# llamacpp-ha Smart load balancer for [llama.cpp](https://github.com/ggerganov/llama.cpp) servers. Presents a single OpenAI-compatible API endpoint while distributing inference requests across multiple backends with slot-aware scheduling, session affinity, model-affinity reordering, and a live monitor page. ## Features - **OpenAI-compatible API** — drop-in replacement for any client using `/v1/chat/completions`, `/v1/completions`, `/v1/embeddings`, `/v1/models`, etc. - **Slot-aware scheduling** — tracks each backend's inference slot capacity (from `/slots`) and queues requests instead of sending them to overloaded backends - **Preemption prevention** — optional `max_models` limit per backend prevents llama.cpp from evicting a running model's KV cache when a new model request arrives - **Model-affinity reordering** — queued requests for an already-loaded model can be promoted ahead of waiting requests (configurable with starvation protection) - **Session affinity** — routes follow-up turns in a conversation back to the same backend, improving KV-cache reuse - **Round-robin policy** — distributes load evenly across backends per model - **Backend health polling** — continuously polls `/health` and `/v1/models`; removes dead backends from rotation automatically; resets slot counters on recovery - **Backend failover** — if a backend becomes unreachable mid-request, the proxy automatically retries on another live backend with a free slot; catch-all (web UI) paths also retry across all live backends - **Request queue** — FIFO queue with configurable timeout; returns 503 with a JSON error body when no slot becomes available in time - **API key auth** — optional `Bearer` token validation; the proxy rewrites outbound auth to match each backend's own key - **Monitor page** — self-contained HTML dashboard at `/monitor` (no CDN dependencies); auto-refreshes every 3 seconds - **Catch-all proxy** — non-inference paths are forwarded best-effort to a live backend ## Requirements - Python 3.13+ - llama.cpp server(s) with `--slots` endpoint enabled ## Installation ```bash pip install . # or in editable mode for development: pip install -e ".[test]" ``` ## Quick start Copy the example config and edit it: ```bash cp config.json.example config.json llamacpp-ha --config config.json ``` The proxy starts on `http://0.0.0.0:8080` by default. ## Configuration Configuration is a JSON file. All fields also accept environment variable overrides with the `LLAMACPP_HA_` prefix (nested fields use `__` as delimiter). ```json { "host": "0.0.0.0", "port": 8080, "api_keys": ["your-secret-key"], "poll_interval": 5, "slot_wait_timeout": 30, "session_idle_ttl": 300, "default_slot_capacity": 1, "default_max_models": 1, "max_queue_skip": 4, "model_limits": { "my-large-model": 1 }, "backends": [ { "url": "http://localhost:8081", "api_key": null, "model_ids": [], "max_models": 1 }, { "url": "http://localhost:8082", "api_key": "backend-secret", "model_ids": ["llama3"], "max_models": null } ] } ``` ### Global fields | Field | Default | Description | |---|---|---| | `host` | `0.0.0.0` | Listen address | | `port` | `8080` | Listen port | | `api_keys` | `[]` | Accepted bearer tokens. Empty = no auth. | | `poll_interval` | `5.0` | Seconds between backend health polls | | `slot_wait_timeout` | `30.0` | Max seconds a request waits for a free slot | | `session_idle_ttl` | `300.0` | Seconds before an idle session is evicted | | `default_slot_capacity` | `1` | Initial slot count per backend used before the first `/slots` poll completes | | `default_max_models` | `null` | Maximum concurrent models per backend (null = unlimited). Applied to backends that do not set their own `max_models`. | | `max_queue_skip` | `0` | How many times a queued request may be bypassed by a model-affinity promotion before it is frozen at head-of-line. `0` disables reordering. | | `model_unload_delay` | `3.0` | Seconds a backend stays sticky to its last model after all slots drain. Prevents unnecessary model swaps for follow-up requests (title generation, suggestions) that arrive shortly after the main response. `0` disables. | | `model_limits` | `{}` | Per-model global concurrency cap across all backends (e.g. `{"my-large-model": 1}`). Use for models too large to run simultaneously due to RAM constraints. | ### Per-backend fields | Field | Default | Description | |---|---|---| | `url` | required | Backend base URL | | `api_key` | `null` | Injected as `Authorization: Bearer ` on outbound requests; client key is stripped | | `model_ids` | `[]` | Override the model list instead of polling `/v1/models` | | `max_models` | `null` | Maximum concurrent distinct models on this backend. Overrides `default_max_models`. `null` = unlimited. | ### Environment variable overrides ```bash LLAMACPP_HA_PORT=9090 llamacpp-ha --config config.json LLAMACPP_HA_API_KEYS='["key1","key2"]' llamacpp-ha --config config.json ``` ## Preemption prevention (`max_models`) llama.cpp evicts the current model's KV cache when a different model is loaded, which can interrupt an in-flight request. Setting `max_models: 1` on a backend tells the proxy to block requests for a second model until all slots for the first model are released. ```json { "default_max_models": 1, "backends": [ {"url": "http://localhost:8081"}, {"url": "http://localhost:8082"} ] } ``` With this configuration each backend serves exactly one model at a time. When both backends are busy with different models, a third request waits in the queue until a slot is freed on a compatible backend. Set `default_max_models: 2` (or higher) to allow two models to share a backend's slots simultaneously when hardware permits. ## Global model concurrency limit (`model_limits`) Some models are large enough that running two instances simultaneously would exhaust system RAM (e.g. a 70 B model that needs CPU offloading). `model_limits` caps the total number of in-flight requests for a specific model across **all** backends combined. ```json { "model_limits": { "my-70b-model": 1 } } ``` When the cap is reached the proxy queues further requests for that model exactly as it does when a backend's slots are full. The request is dispatched as soon as the running instance completes and releases its slot. `slot_wait_timeout` still applies — a queued request that cannot be dispatched within the timeout receives a 503. `model_limits` is complementary to `max_models`: `max_models` prevents a single backend from loading a second model (preemption prevention), while `model_limits` limits how many requests for one model can be active across the entire cluster at once. ## Model-affinity reordering (`max_queue_skip`) By default the queue is strict FIFO. Setting `max_queue_skip` to a positive integer enables a two-phase dispatch: 1. **Affinity pass** — scans the queue for requests whose model is already active on a free backend. Matching requests are promoted and dispatched immediately, bypassing earlier entries. 2. **FIFO pass** — remaining entries are dispatched in arrival order. Each time a request is bypassed, its `skip_count` is incremented. Once `skip_count` reaches `max_queue_skip` the entry is frozen at head-of-line and blocks the affinity pass — preventing indefinite starvation. ```json { "max_queue_skip": 4 } ``` This trades strict fairness for throughput: a warm model serves back-to-back requests without stalling for a cold-start on another model, while the `max_queue_skip` cap guarantees every request is eventually served. ## CLI reference ``` llamacpp-ha [--config PATH] [--host HOST] [--port PORT] [--log-level LEVEL] ``` `--config` defaults to `config.json` in the current directory. `--host` and `--port` override the values in the config file. ## API endpoints | Method | Path | Description | |---|---|---| | `GET` | `/health` | Returns `{"status":"ok"}` if at least one backend is live, 503 otherwise | | `GET` | `/v1/models` | Aggregated model list across all live backends | | `POST` | `/v1/chat/completions` | Slot-gated, session-aware inference (streaming supported) | | `POST` | `/v1/completions` | Same as above | | `POST` | `/v1/embeddings` | Slot-gated pass-through | | `POST` | `/v1/images/*`, `/v1/audio/*` | Slot-gated pass-through | | `*` | `/*` | Best-effort forward to any live backend | | `GET` | `/monitor` | HTML dashboard | | `GET` | `/monitor/data` | Dashboard data as JSON (exempt from API key auth) | ### Session affinity The proxy assigns a session ID to every request. It is sent back via both a cookie (`x-llm-session`) and a response header (`X-Session-ID`). Clients can echo either on subsequent requests to pin their conversation to the same backend. The affinity record expires after `session_idle_ttl` seconds of inactivity. For clients that do not echo a session identifier, the proxy attempts to recover affinity automatically by hashing the incoming messages array and comparing it against stored conversation prefixes. If a match is found, the request is routed to the backend that holds the corresponding KV-cache. The longest match (highest message index) takes precedence. In the unlikely event of a hash collision the request is sent to the wrong backend; the next request will rebalance normally. ### Streaming SSE streaming responses (`text/event-stream`) are passed through transparently. The backend slot is held for the duration of the stream and released when the final chunk is sent. ## Monitor Open `http://localhost:8080/monitor` in a browser. The page polls `/monitor/data` every 3 seconds and shows: - Uptime, total inference requests served (catch-all and web UI paths are excluded), queue depth, active sessions, live backend count - Per-backend: URL, live/dead status, models, slot usage (`acquired/total`), time since last poll - Current queue contents with wait time and estimated token count - Active sessions grouped by model The monitor page and its data endpoint are exempt from API key authentication. ## Architecture ``` client │ ▼ ApiKeyMiddleware │ ▼ FastAPI app (proxy.py) ├── GET /v1/models ──► BackendRegistry.get_all_models() ├── POST /v1/chat/... ──► RequestQueue ──► Scheduler ──► SlotTracker │ │ │ BackendState ◄─┘ │ │ │ forwarder.py ──► aiohttp ──► backend ├── GET /health ├── GET /monitor[/data] ──► monitor.py └── /* catch-all ──► forwarder.forward_best_effort() BackendRegistry polls /health + /v1/models + /slots every poll_interval; calls on_backend_recovered when a dead backend comes back live SlotTracker asyncio.Condition per backend; acquire blocks until slot free; enforces max_models (preemption prevention) SessionStore SHA-256 prefix hash → preferred backend URL; TTL eviction RequestQueue FIFO asyncio.Future queue; asyncio.Event for wakeup Scheduler two-phase dispatch (affinity pass + FIFO); N-skip reordering RoundRobinPolicy per-model atomic counter for backend selection ``` ## Development ```bash # Install with test dependencies pip install -e ".[test]" # Run all tests python -m pytest # Run a specific module python -m pytest tests/test_scheduler.py -v ``` Tests use `unittest.IsolatedAsyncioTestCase` for async tests and Starlette's `TestClient` as a context manager for integration tests that require the full lifespan (aiohttp session, scheduler, registry). ## License MIT