2026-05-18 01:02:57 +02:00
2026-05-18 01:02:57 +02:00
2026-05-18 01:02:57 +02:00
2026-05-18 00:34:27 +02:00
2026-05-17 09:54:18 +02:00
2026-05-17 09:54:18 +02:00
2026-05-17 09:54:18 +02:00
2026-05-18 00:34:27 +02:00
2026-05-17 23:20:04 +02:00
2026-05-18 00:43:53 +02:00
2026-05-18 00:34:27 +02:00

llamacpp-ha

Smart load balancer for llama.cpp servers. Presents a single OpenAI-compatible API endpoint while distributing inference requests across multiple backends with slot-aware scheduling, session affinity, model-affinity reordering, and a live monitor page.

Features

  • OpenAI-compatible API — drop-in replacement for any client using /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/models, etc.
  • Slot-aware scheduling — tracks each backends inference slot capacity (from /slots) and queues requests instead of sending them to overloaded backends
  • Preemption prevention — optional max_models limit per backend prevents llama.cpp from evicting a running models KV cache when a new model request arrives
  • Model-affinity reordering — queued requests for an already-loaded model can be promoted ahead of waiting requests (configurable with starvation protection)
  • Session affinity — routes follow-up turns in a conversation back to the same backend, improving KV-cache reuse
  • Round-robin policy — distributes load evenly across backends per model
  • Backend health polling — continuously polls /health and /v1/models; removes dead backends from rotation automatically; resets slot counters on recovery
  • Backend failover — if a backend becomes unreachable mid-request, the proxy automatically retries on another live backend with a free slot; catch-all (web UI) paths also retry across all live backends
  • Request queue — FIFO queue with configurable timeout; returns 503 with a JSON error body when no slot becomes available in time
  • API key auth — optional Bearer token validation; the proxy rewrites outbound auth to match each backends own key
  • Monitor page — self-contained HTML dashboard at /monitor (no CDN dependencies); auto-refreshes every 3 seconds
  • Catch-all proxy — non-inference paths are forwarded best-effort to a live backend

Requirements

  • Python 3.13+
  • llama.cpp server(s) with --slots endpoint enabled

Installation

pip install .
# or in editable mode for development:
pip install -e ".[test]"

Quick start

Copy the example config and edit it:

cp config.json.example config.json
llamacpp-ha --config config.json

The proxy starts on http://0.0.0.0:8080 by default.

Configuration

Configuration is a JSON file. All fields also accept environment variable overrides with the LLAMACPP_HA_ prefix (nested fields use __ as delimiter).

{
  "host": "0.0.0.0",
  "port": 8080,
  "api_keys": ["your-secret-key"],
  "poll_interval": 5,
  "slot_wait_timeout": 30,
  "session_idle_ttl": 300,
  "default_slot_capacity": 1,
  "default_max_models": 1,
  "max_queue_skip": 4,
  "model_limits": {
    "my-large-model": 1
  },
  "backends": [
    {
      "url": "http://localhost:8081",
      "api_key": null,
      "model_ids": [],
      "max_models": 1
    },
    {
      "url": "http://localhost:8082",
      "api_key": "backend-secret",
      "model_ids": ["llama3"],
      "max_models": null
    }
  ]
}

Global fields

Field Default Description
host 0.0.0.0 Listen address
port 8080 Listen port
api_keys [] Accepted bearer tokens. Empty = no auth.
poll_interval 5.0 Seconds between backend health polls
slot_wait_timeout 30.0 Max seconds a request waits for a free slot
session_idle_ttl 300.0 Seconds before an idle session is evicted
default_slot_capacity 1 Initial slot count per backend used before the first /slots poll completes
default_max_models null Maximum concurrent models per backend (null = unlimited). Applied to backends that do not set their own max_models.
max_queue_skip 0 How many times a queued request may be bypassed by a model-affinity promotion before it is frozen at head-of-line. 0 disables reordering.
model_unload_delay 3.0 Seconds a backend stays sticky to its last model after all slots drain. Prevents unnecessary model swaps for follow-up requests (title generation, suggestions) that arrive shortly after the main response. 0 disables.
model_limits {} Per-model global concurrency cap across all backends (e.g. {"my-large-model": 1}). Use for models too large to run simultaneously due to RAM constraints.

Per-backend fields

Field Default Description
url required Backend base URL
api_key null Injected as Authorization: Bearer <key> on outbound requests; client key is stripped
model_ids [] Override the model list instead of polling /v1/models
max_models null Maximum concurrent distinct models on this backend. Overrides default_max_models. null = unlimited.

Environment variable overrides

LLAMACPP_HA_PORT=9090 llamacpp-ha --config config.json
LLAMACPP_HA_API_KEYS='["key1","key2"]' llamacpp-ha --config config.json

Preemption prevention (max_models)

llama.cpp evicts the current models KV cache when a different model is loaded, which can interrupt an in-flight request. Setting max_models: 1 on a backend tells the proxy to block requests for a second model until all slots for the first model are released.

{
  "default_max_models": 1,
  "backends": [
    {"url": "http://localhost:8081"},
    {"url": "http://localhost:8082"}
  ]
}

With this configuration each backend serves exactly one model at a time. When both backends are busy with different models, a third request waits in the queue until a slot is freed on a compatible backend.

Set default_max_models: 2 (or higher) to allow two models to share a backends slots simultaneously when hardware permits.

Global model concurrency limit (model_limits)

Some models are large enough that running two instances simultaneously would exhaust system RAM (e.g. a 70 B model that needs CPU offloading). model_limits caps the total number of in-flight requests for a specific model across all backends combined.

{
  "model_limits": {
    "my-70b-model": 1
  }
}

When the cap is reached the proxy queues further requests for that model exactly as it does when a backends slots are full. The request is dispatched as soon as the running instance completes and releases its slot. slot_wait_timeout still applies — a queued request that cannot be dispatched within the timeout receives a 503.

model_limits is complementary to max_models: max_models prevents a single backend from loading a second model (preemption prevention), while model_limits limits how many requests for one model can be active across the entire cluster at once.

Model-affinity reordering (max_queue_skip)

By default the queue is strict FIFO. Setting max_queue_skip to a positive integer enables a two-phase dispatch:

  1. Affinity pass — scans the queue for requests whose model is already active on a free backend. Matching requests are promoted and dispatched immediately, bypassing earlier entries.
  2. FIFO pass — remaining entries are dispatched in arrival order.

Each time a request is bypassed, its skip_count is incremented. Once skip_count reaches max_queue_skip the entry is frozen at head-of-line and blocks the affinity pass — preventing indefinite starvation.

{
  "max_queue_skip": 4
}

This trades strict fairness for throughput: a warm model serves back-to-back requests without stalling for a cold-start on another model, while the max_queue_skip cap guarantees every request is eventually served.

CLI reference

llamacpp-ha [--config PATH] [--host HOST] [--port PORT] [--log-level LEVEL]

--config defaults to config.json in the current directory. --host and --port override the values in the config file.

API endpoints

Method Path Description
GET /health Returns {"status":"ok"} if at least one backend is live, 503 otherwise
GET /v1/models Aggregated model list across all live backends
POST /v1/chat/completions Slot-gated, session-aware inference (streaming supported)
POST /v1/completions Same as above
POST /v1/embeddings Slot-gated pass-through
POST /v1/images/*, /v1/audio/* Slot-gated pass-through
* /* Best-effort forward to any live backend
GET /monitor HTML dashboard
GET /monitor/data Dashboard data as JSON (exempt from API key auth)

Session affinity

The proxy assigns a session ID to every request. It is sent back via both a cookie (x-llm-session) and a response header (X-Session-ID). Clients can echo either on subsequent requests to pin their conversation to the same backend. The affinity record expires after session_idle_ttl seconds of inactivity.

For clients that do not echo a session identifier, the proxy attempts to recover affinity automatically by hashing the incoming messages array and comparing it against stored conversation prefixes. If a match is found, the request is routed to the backend that holds the corresponding KV-cache. The longest match (highest message index) takes precedence. In the unlikely event of a hash collision the request is sent to the wrong backend; the next request will rebalance normally.

Streaming

SSE streaming responses (text/event-stream) are passed through transparently. The backend slot is held for the duration of the stream and released when the final chunk is sent.

Monitor

Open http://localhost:8080/monitor in a browser. The page polls /monitor/data every 3 seconds and shows:

  • Uptime, total inference requests served (catch-all and web UI paths are excluded), queue depth, active sessions, live backend count
  • Per-backend: URL, live/dead status, models, slot usage (acquired/total), time since last poll
  • Current queue contents with wait time and estimated token count
  • Active sessions grouped by model

The monitor page and its data endpoint are exempt from API key authentication.

Architecture

client
  │
  ▼
ApiKeyMiddleware
  │
  ▼
FastAPI app (proxy.py)
  ├── GET /v1/models         ──► BackendRegistry.get_all_models()
  ├── POST /v1/chat/...      ──► RequestQueue ──► Scheduler ──► SlotTracker
  │                                                              │
  │                                               BackendState ◄─┘
  │                                                    │
  │                                               forwarder.py ──► aiohttp ──► backend
  ├── GET /health
  ├── GET /monitor[/data]    ──► monitor.py
  └── /* catch-all           ──► forwarder.forward_best_effort()

BackendRegistry              polls /health + /v1/models + /slots every poll_interval;
                             calls on_backend_recovered when a dead backend comes back live
SlotTracker                  asyncio.Condition per backend; acquire blocks until slot free;
                             enforces max_models (preemption prevention)
SessionStore                 SHA-256 prefix hash → preferred backend URL; TTL eviction
RequestQueue                 FIFO asyncio.Future queue; asyncio.Event for wakeup
Scheduler                    two-phase dispatch (affinity pass + FIFO); N-skip reordering
RoundRobinPolicy             per-model atomic counter for backend selection

Development

# Install with test dependencies
pip install -e ".[test]"

# Run all tests
python -m pytest

# Run a specific module
python -m pytest tests/test_scheduler.py -v

Tests use unittest.IsolatedAsyncioTestCase for async tests and Starlettes TestClient as a context manager for integration tests that require the full lifespan (aiohttp session, scheduler, registry).

License

MIT

Description
llamacpp high availability and load balancer
Readme 218 KiB
Languages
Python 100%