Files
llamacpp-ha/README.md
2026-05-18 00:34:27 +02:00

253 lines
12 KiB
Markdown

# llamacpp-ha
Smart load balancer for [llama.cpp](https://github.com/ggerganov/llama.cpp) servers. Presents a single OpenAI-compatible API endpoint while distributing inference requests across multiple backends with slot-aware scheduling, session affinity, model-affinity reordering, and a live monitor page.
## Features
- **OpenAI-compatible API** — drop-in replacement for any client using `/v1/chat/completions`, `/v1/completions`, `/v1/embeddings`, `/v1/models`, etc.
- **Slot-aware scheduling** — tracks each backend's inference slot capacity (from `/slots`) and queues requests instead of sending them to overloaded backends
- **Preemption prevention** — optional `max_models` limit per backend prevents llama.cpp from evicting a running model's KV cache when a new model request arrives
- **Model-affinity reordering** — queued requests for an already-loaded model can be promoted ahead of waiting requests (configurable with starvation protection)
- **Session affinity** — routes follow-up turns in a conversation back to the same backend, improving KV-cache reuse
- **Round-robin policy** — distributes load evenly across backends per model
- **Backend health polling** — continuously polls `/health` and `/v1/models`; removes dead backends from rotation automatically; resets slot counters on recovery
- **Backend failover** — if a backend becomes unreachable mid-request, the proxy automatically retries on another live backend with a free slot; catch-all (web UI) paths also retry across all live backends
- **Request queue** — FIFO queue with configurable timeout; returns 503 with a JSON error body when no slot becomes available in time
- **API key auth** — optional `Bearer` token validation; the proxy rewrites outbound auth to match each backend's own key
- **Monitor page** — self-contained HTML dashboard at `/monitor` (no CDN dependencies); auto-refreshes every 3 seconds
- **Catch-all proxy** — non-inference paths are forwarded best-effort to a live backend
## Requirements
- Python 3.13+
- llama.cpp server(s) with `--slots` endpoint enabled
## Installation
```bash
pip install .
# or in editable mode for development:
pip install -e ".[test]"
```
## Quick start
Copy the example config and edit it:
```bash
cp config.json.example config.json
llamacpp-ha --config config.json
```
The proxy starts on `http://0.0.0.0:8080` by default.
## Configuration
Configuration is a JSON file. All fields also accept environment variable overrides with the `LLAMACPP_HA_` prefix (nested fields use `__` as delimiter).
```json
{
"host": "0.0.0.0",
"port": 8080,
"api_keys": ["your-secret-key"],
"poll_interval": 5,
"slot_wait_timeout": 30,
"session_idle_ttl": 300,
"default_slot_capacity": 1,
"default_max_models": 1,
"max_queue_skip": 4,
"model_limits": {
"my-large-model": 1
},
"backends": [
{
"url": "http://localhost:8081",
"api_key": null,
"model_ids": [],
"max_models": 1
},
{
"url": "http://localhost:8082",
"api_key": "backend-secret",
"model_ids": ["llama3"],
"max_models": null
}
]
}
```
### Global fields
| Field | Default | Description |
|---|---|---|
| `host` | `0.0.0.0` | Listen address |
| `port` | `8080` | Listen port |
| `api_keys` | `[]` | Accepted bearer tokens. Empty = no auth. |
| `poll_interval` | `5.0` | Seconds between backend health polls |
| `slot_wait_timeout` | `30.0` | Max seconds a request waits for a free slot |
| `session_idle_ttl` | `300.0` | Seconds before an idle session is evicted |
| `default_slot_capacity` | `1` | Initial slot count per backend used before the first `/slots` poll completes |
| `default_max_models` | `null` | Maximum concurrent models per backend (null = unlimited). Applied to backends that do not set their own `max_models`. |
| `max_queue_skip` | `0` | How many times a queued request may be bypassed by a model-affinity promotion before it is frozen at head-of-line. `0` disables reordering. |
| `model_unload_delay` | `3.0` | Seconds a backend stays sticky to its last model after all slots drain. Prevents unnecessary model swaps for follow-up requests (title generation, suggestions) that arrive shortly after the main response. `0` disables. |
| `model_limits` | `{}` | Per-model global concurrency cap across all backends (e.g. `{"my-large-model": 1}`). Use for models too large to run simultaneously due to RAM constraints. |
### Per-backend fields
| Field | Default | Description |
|---|---|---|
| `url` | required | Backend base URL |
| `api_key` | `null` | Injected as `Authorization: Bearer <key>` on outbound requests; client key is stripped |
| `model_ids` | `[]` | Override the model list instead of polling `/v1/models` |
| `max_models` | `null` | Maximum concurrent distinct models on this backend. Overrides `default_max_models`. `null` = unlimited. |
### Environment variable overrides
```bash
LLAMACPP_HA_PORT=9090 llamacpp-ha --config config.json
LLAMACPP_HA_API_KEYS='["key1","key2"]' llamacpp-ha --config config.json
```
## Preemption prevention (`max_models`)
llama.cpp evicts the current model's KV cache when a different model is loaded, which can interrupt an in-flight request. Setting `max_models: 1` on a backend tells the proxy to block requests for a second model until all slots for the first model are released.
```json
{
"default_max_models": 1,
"backends": [
{"url": "http://localhost:8081"},
{"url": "http://localhost:8082"}
]
}
```
With this configuration each backend serves exactly one model at a time. When both backends are busy with different models, a third request waits in the queue until a slot is freed on a compatible backend.
Set `default_max_models: 2` (or higher) to allow two models to share a backend's slots simultaneously when hardware permits.
## Global model concurrency limit (`model_limits`)
Some models are large enough that running two instances simultaneously would exhaust system RAM (e.g. a 70 B model that needs CPU offloading). `model_limits` caps the total number of in-flight requests for a specific model across **all** backends combined.
```json
{
"model_limits": {
"my-70b-model": 1
}
}
```
When the cap is reached the proxy queues further requests for that model exactly as it does when a backend's slots are full. The request is dispatched as soon as the running instance completes and releases its slot. `slot_wait_timeout` still applies — a queued request that cannot be dispatched within the timeout receives a 503.
`model_limits` is complementary to `max_models`: `max_models` prevents a single backend from loading a second model (preemption prevention), while `model_limits` limits how many requests for one model can be active across the entire cluster at once.
## Model-affinity reordering (`max_queue_skip`)
By default the queue is strict FIFO. Setting `max_queue_skip` to a positive integer enables a two-phase dispatch:
1. **Affinity pass** — scans the queue for requests whose model is already active on a free backend. Matching requests are promoted and dispatched immediately, bypassing earlier entries.
2. **FIFO pass** — remaining entries are dispatched in arrival order.
Each time a request is bypassed, its `skip_count` is incremented. Once `skip_count` reaches `max_queue_skip` the entry is frozen at head-of-line and blocks the affinity pass — preventing indefinite starvation.
```json
{
"max_queue_skip": 4
}
```
This trades strict fairness for throughput: a warm model serves back-to-back requests without stalling for a cold-start on another model, while the `max_queue_skip` cap guarantees every request is eventually served.
## CLI reference
```
llamacpp-ha [--config PATH] [--host HOST] [--port PORT] [--log-level LEVEL]
```
`--config` defaults to `config.json` in the current directory. `--host` and `--port` override the values in the config file.
## API endpoints
| Method | Path | Description |
|---|---|---|
| `GET` | `/health` | Returns `{"status":"ok"}` if at least one backend is live, 503 otherwise |
| `GET` | `/v1/models` | Aggregated model list across all live backends |
| `POST` | `/v1/chat/completions` | Slot-gated, session-aware inference (streaming supported) |
| `POST` | `/v1/completions` | Same as above |
| `POST` | `/v1/embeddings` | Slot-gated pass-through |
| `POST` | `/v1/images/*`, `/v1/audio/*` | Slot-gated pass-through |
| `*` | `/*` | Best-effort forward to any live backend |
| `GET` | `/monitor` | HTML dashboard |
| `GET` | `/monitor/data` | Dashboard data as JSON (exempt from API key auth) |
### Session affinity
The proxy assigns a session ID to every request. It is sent back via both a cookie (`x-llm-session`) and a response header (`X-Session-ID`). Clients can echo either on subsequent requests to pin their conversation to the same backend. The affinity record expires after `session_idle_ttl` seconds of inactivity.
For clients that do not echo a session identifier, the proxy attempts to recover affinity automatically by hashing the incoming messages array and comparing it against stored conversation prefixes. If a match is found, the request is routed to the backend that holds the corresponding KV-cache. The longest match (highest message index) takes precedence. In the unlikely event of a hash collision the request is sent to the wrong backend; the next request will rebalance normally.
### Streaming
SSE streaming responses (`text/event-stream`) are passed through transparently. The backend slot is held for the duration of the stream and released when the final chunk is sent.
## Monitor
Open `http://localhost:8080/monitor` in a browser. The page polls `/monitor/data` every 3 seconds and shows:
- Uptime, total inference requests served (catch-all and web UI paths are excluded), queue depth, active sessions, live backend count
- Per-backend: URL, live/dead status, models, slot usage (`acquired/total`), time since last poll
- Current queue contents with wait time and estimated token count
- Active sessions grouped by model
The monitor page and its data endpoint are exempt from API key authentication.
## Architecture
```
client
ApiKeyMiddleware
FastAPI app (proxy.py)
├── GET /v1/models ──► BackendRegistry.get_all_models()
├── POST /v1/chat/... ──► RequestQueue ──► Scheduler ──► SlotTracker
│ │
│ BackendState ◄─┘
│ │
│ forwarder.py ──► aiohttp ──► backend
├── GET /health
├── GET /monitor[/data] ──► monitor.py
└── /* catch-all ──► forwarder.forward_best_effort()
BackendRegistry polls /health + /v1/models + /slots every poll_interval;
calls on_backend_recovered when a dead backend comes back live
SlotTracker asyncio.Condition per backend; acquire blocks until slot free;
enforces max_models (preemption prevention)
SessionStore SHA-256 prefix hash → preferred backend URL; TTL eviction
RequestQueue FIFO asyncio.Future queue; asyncio.Event for wakeup
Scheduler two-phase dispatch (affinity pass + FIFO); N-skip reordering
RoundRobinPolicy per-model atomic counter for backend selection
```
## Development
```bash
# Install with test dependencies
pip install -e ".[test]"
# Run all tests
python -m pytest
# Run a specific module
python -m pytest tests/test_scheduler.py -v
```
Tests use `unittest.IsolatedAsyncioTestCase` for async tests and Starlette's `TestClient` as a context manager for integration tests that require the full lifespan (aiohttp session, scheduler, registry).
## License
MIT