Branch main was renamed to master.
llamacpp-ha
Smart load balancer for llama.cpp servers. Presents a single OpenAI-compatible API endpoint while distributing inference requests across multiple backends with slot-aware scheduling, session affinity, model-affinity reordering, and a live monitor page.
Features
- OpenAI-compatible API — drop-in replacement for any
client using
/v1/chat/completions,/v1/completions,/v1/embeddings,/v1/models, etc. - Slot-aware scheduling — tracks each backend’s
inference slot capacity (from
/slots) and queues requests instead of sending them to overloaded backends - Preemption prevention — optional
max_modelslimit per backend prevents llama.cpp from evicting a running model’s KV cache when a new model request arrives - Model-affinity reordering — queued requests for an already-loaded model can be promoted ahead of waiting requests (configurable with starvation protection)
- Session affinity — routes follow-up turns in a conversation back to the same backend, improving KV-cache reuse
- Round-robin policy — distributes load evenly across backends per model
- Backend health polling — continuously polls
/healthand/v1/models; removes dead backends from rotation automatically; resets slot counters on recovery - Backend failover — if a backend becomes unreachable mid-request, the proxy automatically retries on another live backend with a free slot; catch-all (web UI) paths also retry across all live backends
- Request queue — FIFO queue with configurable timeout; returns 503 with a JSON error body when no slot becomes available in time
- API key auth — optional
Bearertoken validation; the proxy rewrites outbound auth to match each backend’s own key - Monitor page — self-contained HTML dashboard at
/monitor(no CDN dependencies); auto-refreshes every 3 seconds - Catch-all proxy — non-inference paths are forwarded best-effort to a live backend
Requirements
- Python 3.13+
- llama.cpp server(s) with
--slotsendpoint enabled
Installation
Quick start
Copy the example config and edit it:
The proxy starts on http://0.0.0.0:8080 by default.
Configuration
Configuration is a JSON file. All fields also accept environment
variable overrides with the LLAMACPP_HA_ prefix (nested
fields use __ as delimiter).
{
"host": "0.0.0.0",
"port": 8080,
"api_keys": ["your-secret-key"],
"poll_interval": 5,
"slot_wait_timeout": 30,
"session_idle_ttl": 300,
"default_slot_capacity": 1,
"default_max_models": 1,
"max_queue_skip": 4,
"model_limits": {
"my-large-model": 1
},
"backends": [
{
"url": "http://localhost:8081",
"api_key": null,
"model_ids": [],
"max_models": 1
},
{
"url": "http://localhost:8082",
"api_key": "backend-secret",
"model_ids": ["llama3"],
"max_models": null
}
]
}Global fields
| Field | Default | Description |
|---|---|---|
host |
0.0.0.0 |
Listen address |
port |
8080 |
Listen port |
api_keys |
[] |
Accepted bearer tokens. Empty = no auth. |
poll_interval |
5.0 |
Seconds between backend health polls |
slot_wait_timeout |
30.0 |
Max seconds a request waits for a free slot |
session_idle_ttl |
300.0 |
Seconds before an idle session is evicted |
default_slot_capacity |
1 |
Initial slot count per backend used before the first
/slots poll completes |
default_max_models |
null |
Maximum concurrent models per backend (null = unlimited). Applied to
backends that do not set their own max_models. |
max_queue_skip |
0 |
How many times a queued request may be bypassed by a model-affinity
promotion before it is frozen at head-of-line. 0 disables
reordering. |
model_unload_delay |
3.0 |
Seconds a backend stays sticky to its last model after all slots
drain. Prevents unnecessary model swaps for follow-up requests (title
generation, suggestions) that arrive shortly after the main response.
0 disables. |
model_limits |
{} |
Per-model global concurrency cap across all backends
(e.g. {"my-large-model": 1}). Use for models too large to
run simultaneously due to RAM constraints. |
Per-backend fields
| Field | Default | Description |
|---|---|---|
url |
required | Backend base URL |
api_key |
null |
Injected as Authorization: Bearer <key> on
outbound requests; client key is stripped |
model_ids |
[] |
Override the model list instead of polling
/v1/models |
max_models |
null |
Maximum concurrent distinct models on this backend. Overrides
default_max_models. null = unlimited. |
Environment variable overrides
LLAMACPP_HA_PORT=9090 llamacpp-ha --config config.json
LLAMACPP_HA_API_KEYS='["key1","key2"]' llamacpp-ha --config config.jsonPreemption prevention
(max_models)
llama.cpp evicts the current model’s KV cache when a different model
is loaded, which can interrupt an in-flight request. Setting
max_models: 1 on a backend tells the proxy to block
requests for a second model until all slots for the first model are
released.
{
"default_max_models": 1,
"backends": [
{"url": "http://localhost:8081"},
{"url": "http://localhost:8082"}
]
}With this configuration each backend serves exactly one model at a time. When both backends are busy with different models, a third request waits in the queue until a slot is freed on a compatible backend.
Set default_max_models: 2 (or higher) to allow two
models to share a backend’s slots simultaneously when hardware
permits.
Global model
concurrency limit (model_limits)
Some models are large enough that running two instances
simultaneously would exhaust system RAM (e.g. a 70 B model that needs
CPU offloading). model_limits caps the total number of
in-flight requests for a specific model across all
backends combined.
When the cap is reached the proxy queues further requests for that
model exactly as it does when a backend’s slots are full. The request is
dispatched as soon as the running instance completes and releases its
slot. slot_wait_timeout still applies — a queued request
that cannot be dispatched within the timeout receives a 503.
model_limits is complementary to
max_models: max_models prevents a single
backend from loading a second model (preemption prevention), while
model_limits limits how many requests for one model can be
active across the entire cluster at once.
Model-affinity
reordering (max_queue_skip)
By default the queue is strict FIFO. Setting
max_queue_skip to a positive integer enables a two-phase
dispatch:
- Affinity pass — scans the queue for requests whose model is already active on a free backend. Matching requests are promoted and dispatched immediately, bypassing earlier entries.
- FIFO pass — remaining entries are dispatched in arrival order.
Each time a request is bypassed, its skip_count is
incremented. Once skip_count reaches
max_queue_skip the entry is frozen at head-of-line and
blocks the affinity pass — preventing indefinite starvation.
This trades strict fairness for throughput: a warm model serves
back-to-back requests without stalling for a cold-start on another
model, while the max_queue_skip cap guarantees every
request is eventually served.
CLI reference
llamacpp-ha [--config PATH] [--host HOST] [--port PORT] [--log-level LEVEL]
--config defaults to config.json in the
current directory. --host and --port override
the values in the config file.
API endpoints
| Method | Path | Description |
|---|---|---|
GET |
/health |
Returns {"status":"ok"} if at least one backend is
live, 503 otherwise |
GET |
/v1/models |
Aggregated model list across all live backends |
POST |
/v1/chat/completions |
Slot-gated, session-aware inference (streaming supported) |
POST |
/v1/completions |
Same as above |
POST |
/v1/embeddings |
Slot-gated pass-through |
POST |
/v1/images/*, /v1/audio/* |
Slot-gated pass-through |
* |
/* |
Best-effort forward to any live backend |
GET |
/monitor |
HTML dashboard |
GET |
/monitor/data |
Dashboard data as JSON (exempt from API key auth) |
Session affinity
The proxy assigns a session ID to every request. It is sent back via
both a cookie (x-llm-session) and a response header
(X-Session-ID). Clients can echo either on subsequent
requests to pin their conversation to the same backend. The affinity
record expires after session_idle_ttl seconds of
inactivity.
For clients that do not echo a session identifier, the proxy attempts to recover affinity automatically by hashing the incoming messages array and comparing it against stored conversation prefixes. If a match is found, the request is routed to the backend that holds the corresponding KV-cache. The longest match (highest message index) takes precedence. In the unlikely event of a hash collision the request is sent to the wrong backend; the next request will rebalance normally.
Streaming
SSE streaming responses (text/event-stream) are passed
through transparently. The backend slot is held for the duration of the
stream and released when the final chunk is sent.
Monitor
Open http://localhost:8080/monitor in a browser. The
page polls /monitor/data every 3 seconds and shows:
- Uptime, total inference requests served (catch-all and web UI paths are excluded), queue depth, active sessions, live backend count
- Per-backend: URL, live/dead status, models, slot usage
(
acquired/total), time since last poll - Current queue contents with wait time and estimated token count
- Active sessions grouped by model
The monitor page and its data endpoint are exempt from API key authentication.
Architecture
client
│
▼
ApiKeyMiddleware
│
▼
FastAPI app (proxy.py)
├── GET /v1/models ──► BackendRegistry.get_all_models()
├── POST /v1/chat/... ──► RequestQueue ──► Scheduler ──► SlotTracker
│ │
│ BackendState ◄─┘
│ │
│ forwarder.py ──► aiohttp ──► backend
├── GET /health
├── GET /monitor[/data] ──► monitor.py
└── /* catch-all ──► forwarder.forward_best_effort()
BackendRegistry polls /health + /v1/models + /slots every poll_interval;
calls on_backend_recovered when a dead backend comes back live
SlotTracker asyncio.Condition per backend; acquire blocks until slot free;
enforces max_models (preemption prevention)
SessionStore SHA-256 prefix hash → preferred backend URL; TTL eviction
RequestQueue FIFO asyncio.Future queue; asyncio.Event for wakeup
Scheduler two-phase dispatch (affinity pass + FIFO); N-skip reordering
RoundRobinPolicy per-model atomic counter for backend selection
Development
# Install with test dependencies
pip install -e ".[test]"
# Run all tests
python -m pytest
# Run a specific module
python -m pytest tests/test_scheduler.py -vTests use unittest.IsolatedAsyncioTestCase for async
tests and Starlette’s TestClient as a context manager for
integration tests that require the full lifespan (aiohttp session,
scheduler, registry).
License
MIT