Smart load balancer for llama.cpp servers. Presents a single OpenAI-compatible API endpoint while distributing inference requests across multiple backends with slot-aware scheduling, session affinity, model-affinity reordering, and a live monitor page.

Features

OpenAI-compatible API — drop-in replacement for any client using /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/models, etc.
Slot-aware scheduling — tracks each backend’s inference slot capacity (from /slots) and queues requests instead of sending them to overloaded backends
Preemption prevention — optional max_models limit per backend prevents llama.cpp from evicting a running model’s KV cache when a new model request arrives
Model-affinity reordering — queued requests for an already-loaded model can be promoted ahead of waiting requests (configurable with starvation protection)
Session affinity — routes follow-up turns in a conversation back to the same backend, improving KV-cache reuse
Round-robin policy — distributes load evenly across backends per model
Backend health polling — continuously polls /health and /v1/models; removes dead backends from rotation automatically; resets slot counters on recovery
Backend failover — if a backend becomes unreachable mid-request, the proxy automatically retries on another live backend with a free slot; catch-all (web UI) paths also retry across all live backends
Request queue — FIFO queue with configurable timeout; returns 503 with a JSON error body when no slot becomes available in time
API key auth — optional Bearer token validation; the proxy rewrites outbound auth to match each backend’s own key
Monitor page — self-contained HTML dashboard at /monitor (no CDN dependencies); auto-refreshes every 3 seconds
Catch-all proxy — non-inference paths are forwarded best-effort to a live backend

Requirements

Python 3.13+
llama.cpp server(s) with --slots endpoint enabled

Installation

pip install .
# or in editable mode for development:
pip install -e ".[test]"

Quick start

Copy the example config and edit it:

cp config.json.example config.json
llamacpp-ha --config config.json

The proxy starts on http://0.0.0.0:8080 by default.

Configuration

Configuration is a JSON file. All fields also accept environment variable overrides with the LLAMACPP_HA_ prefix (nested fields use __ as delimiter).

{
  "host": "0.0.0.0",
  "port": 8080,
  "api_keys": ["your-secret-key"],
  "poll_interval": 5,
  "slot_wait_timeout": 30,
  "session_idle_ttl": 300,
  "default_slot_capacity": 1,
  "default_max_models": 1,
  "max_queue_skip": 4,
  "model_limits": {
    "my-large-model": 1
  },
  "backends": [
    {
      "url": "http://localhost:8081",
      "api_key": null,
      "model_ids": [],
      "max_models": 1
    },
    {
      "url": "http://localhost:8082",
      "api_key": "backend-secret",
      "model_ids": ["llama3"],
      "max_models": null
    }
  ]
}

Global fields

Field	Default	Description
`host`	`0.0.0.0`	Listen address
`port`	`8080`	Listen port
`api_keys`	`[]`	Accepted bearer tokens. Empty = no auth.
`poll_interval`	`5.0`	Seconds between backend health polls
`slot_wait_timeout`	`30.0`	Max seconds a request waits for a free slot
`session_idle_ttl`	`300.0`	Seconds before an idle session is evicted
`default_slot_capacity`	`1`	Initial slot count per backend used before the first `/slots` poll completes
`default_max_models`	`null`	Maximum concurrent models per backend (null = unlimited). Applied to backends that do not set their own `max_models`.
`max_queue_skip`	`0`	How many times a queued request may be bypassed by a model-affinity promotion before it is frozen at head-of-line. `0` disables reordering.
`model_unload_delay`	`3.0`	Seconds a backend stays sticky to its last model after all slots drain. Prevents unnecessary model swaps for follow-up requests (title generation, suggestions) that arrive shortly after the main response. `0` disables.
`model_limits`	`{}`	Per-model global concurrency cap across all backends (e.g. `{"my-large-model": 1}`). Use for models too large to run simultaneously due to RAM constraints.

Per-backend fields

Field	Default	Description
`url`	required	Backend base URL
`api_key`	`null`	Injected as `Authorization: Bearer <key>` on outbound requests; client key is stripped
`model_ids`	`[]`	Override the model list instead of polling `/v1/models`
`max_models`	`null`	Maximum concurrent distinct models on this backend. Overrides `default_max_models`. `null` = unlimited.

Environment variable overrides

LLAMACPP_HA_PORT=9090 llamacpp-ha --config config.json
LLAMACPP_HA_API_KEYS='["key1","key2"]' llamacpp-ha --config config.json

Preemption prevention (`max_models`)

llama.cpp evicts the current model’s KV cache when a different model is loaded, which can interrupt an in-flight request. Setting max_models: 1 on a backend tells the proxy to block requests for a second model until all slots for the first model are released.

{
  "default_max_models": 1,
  "backends": [
    {"url": "http://localhost:8081"},
    {"url": "http://localhost:8082"}
  ]
}

With this configuration each backend serves exactly one model at a time. When both backends are busy with different models, a third request waits in the queue until a slot is freed on a compatible backend.

Set default_max_models: 2 (or higher) to allow two models to share a backend’s slots simultaneously when hardware permits.

Global model concurrency limit (`model_limits`)

Some models are large enough that running two instances simultaneously would exhaust system RAM (e.g. a 70 B model that needs CPU offloading). model_limits caps the total number of in-flight requests for a specific model across all backends combined.

{
  "model_limits": {
    "my-70b-model": 1
  }
}

When the cap is reached the proxy queues further requests for that model exactly as it does when a backend’s slots are full. The request is dispatched as soon as the running instance completes and releases its slot. slot_wait_timeout still applies — a queued request that cannot be dispatched within the timeout receives a 503.

model_limits is complementary to max_models: max_models prevents a single backend from loading a second model (preemption prevention), while model_limits limits how many requests for one model can be active across the entire cluster at once.

Model-affinity reordering (`max_queue_skip`)

By default the queue is strict FIFO. Setting max_queue_skip to a positive integer enables a two-phase dispatch:

Affinity pass — scans the queue for requests whose model is already active on a free backend. Matching requests are promoted and dispatched immediately, bypassing earlier entries.
FIFO pass — remaining entries are dispatched in arrival order.

Each time a request is bypassed, its skip_count is incremented. Once skip_count reaches max_queue_skip the entry is frozen at head-of-line and blocks the affinity pass — preventing indefinite starvation.

{
  "max_queue_skip": 4
}

This trades strict fairness for throughput: a warm model serves back-to-back requests without stalling for a cold-start on another model, while the max_queue_skip cap guarantees every request is eventually served.

CLI reference

llamacpp-ha [--config PATH] [--host HOST] [--port PORT] [--log-level LEVEL]

--config defaults to config.json in the current directory. --host and --port override the values in the config file.

API endpoints

Method	Path	Description
`GET`	`/health`	Returns `{"status":"ok"}` if at least one backend is live, 503 otherwise
`GET`	`/v1/models`	Aggregated model list across all live backends
`POST`	`/v1/chat/completions`	Slot-gated, session-aware inference (streaming supported)
`POST`	`/v1/completions`	Same as above
`POST`	`/v1/embeddings`	Slot-gated pass-through
`POST`	`/v1/images/`, `/v1/audio/`	Slot-gated pass-through
`*`	`/*`	Best-effort forward to any live backend
`GET`	`/monitor`	HTML dashboard
`GET`	`/monitor/data`	Dashboard data as JSON (exempt from API key auth)

Session affinity

The proxy assigns a session ID to every request. It is sent back via both a cookie (x-llm-session) and a response header (X-Session-ID). Clients can echo either on subsequent requests to pin their conversation to the same backend. The affinity record expires after session_idle_ttl seconds of inactivity.

For clients that do not echo a session identifier, the proxy attempts to recover affinity automatically by hashing the incoming messages array and comparing it against stored conversation prefixes. If a match is found, the request is routed to the backend that holds the corresponding KV-cache. The longest match (highest message index) takes precedence. In the unlikely event of a hash collision the request is sent to the wrong backend; the next request will rebalance normally.

Streaming

SSE streaming responses (text/event-stream) are passed through transparently. The backend slot is held for the duration of the stream and released when the final chunk is sent.

Monitor

Open http://localhost:8080/monitor in a browser. The page polls /monitor/data every 3 seconds and shows:

Uptime, total inference requests served (catch-all and web UI paths are excluded), queue depth, active sessions, live backend count
Per-backend: URL, live/dead status, models, slot usage (acquired/total), time since last poll
Current queue contents with wait time and estimated token count
Active sessions grouped by model

The monitor page and its data endpoint are exempt from API key authentication.

Architecture

client
  │
  ▼
ApiKeyMiddleware
  │
  ▼
FastAPI app (proxy.py)
  ├── GET /v1/models         ──► BackendRegistry.get_all_models()
  ├── POST /v1/chat/...      ──► RequestQueue ──► Scheduler ──► SlotTracker
  │                                                              │
  │                                               BackendState ◄─┘
  │                                                    │
  │                                               forwarder.py ──► aiohttp ──► backend
  ├── GET /health
  ├── GET /monitor[/data]    ──► monitor.py
  └── /* catch-all           ──► forwarder.forward_best_effort()

BackendRegistry              polls /health + /v1/models + /slots every poll_interval;
                             calls on_backend_recovered when a dead backend comes back live
SlotTracker                  asyncio.Condition per backend; acquire blocks until slot free;
                             enforces max_models (preemption prevention)
SessionStore                 SHA-256 prefix hash → preferred backend URL; TTL eviction
RequestQueue                 FIFO asyncio.Future queue; asyncio.Event for wakeup
Scheduler                    two-phase dispatch (affinity pass + FIFO); N-skip reordering
RoundRobinPolicy             per-model atomic counter for backend selection

Development

# Install with test dependencies
pip install -e ".[test]"

# Run all tests
python -m pytest

# Run a specific module
python -m pytest tests/test_scheduler.py -v

Tests use unittest.IsolatedAsyncioTestCase for async tests and Starlette’s TestClient as a context manager for integration tests that require the full lifespan (aiohttp session, scheduler, registry).

License

MIT

README.md Unescape Escape

llamacpp-ha