feat: model unload

2026-05-18 00:34:27 +02:00
parent bcebaf0e93
commit 13fb341354
6 changed files with 86 additions and 1 deletions
--- a/README.md
+++ b/README.md
@@ -89,6 +89,7 @@ Configuration is a JSON file. All fields also accept environment variable overri
 | `default_slot_capacity` | `1` | Initial slot count per backend used before the first `/slots` poll completes |
 | `default_max_models` | `null` | Maximum concurrent models per backend (null = unlimited). Applied to backends that do not set their own `max_models`. |
 | `max_queue_skip` | `0` | How many times a queued request may be bypassed by a model-affinity promotion before it is frozen at head-of-line. `0` disables reordering. |
+| `model_unload_delay` | `3.0` | Seconds a backend stays sticky to its last model after all slots drain. Prevents unnecessary model swaps for follow-up requests (title generation, suggestions) that arrive shortly after the main response. `0` disables. |
 | `model_limits` | `{}` | Per-model global concurrency cap across all backends (e.g. `{"my-large-model": 1}`). Use for models too large to run simultaneously due to RAM constraints. |

 ### Per-backend fields