feat: model unload
This commit is contained in:
@@ -89,6 +89,7 @@ Configuration is a JSON file. All fields also accept environment variable overri
|
||||
| `default_slot_capacity` | `1` | Initial slot count per backend used before the first `/slots` poll completes |
|
||||
| `default_max_models` | `null` | Maximum concurrent models per backend (null = unlimited). Applied to backends that do not set their own `max_models`. |
|
||||
| `max_queue_skip` | `0` | How many times a queued request may be bypassed by a model-affinity promotion before it is frozen at head-of-line. `0` disables reordering. |
|
||||
| `model_unload_delay` | `3.0` | Seconds a backend stays sticky to its last model after all slots drain. Prevents unnecessary model swaps for follow-up requests (title generation, suggestions) that arrive shortly after the main response. `0` disables. |
|
||||
| `model_limits` | `{}` | Per-model global concurrency cap across all backends (e.g. `{"my-large-model": 1}`). Use for models too large to run simultaneously due to RAM constraints. |
|
||||
|
||||
### Per-backend fields
|
||||
|
||||
Reference in New Issue
Block a user