llamacpp-ha/README.md

# llamacpp-ha

Smart load balancer for [llama.cpp](https://github.com/ggerganov/llama.cpp) servers. Presents a single OpenAI-compatible API endpoint while distributing inference requests across multiple backends with slot-aware scheduling, session affinity, model-affinity reordering, and a live monitor page.

## Features

- **OpenAI-compatible API** — drop-in replacement for any client using `/v1/chat/completions`, `/v1/completions`, `/v1/embeddings`, `/v1/models`, etc.
- **Slot-aware scheduling** — tracks each backend's inference slot capacity (from `/slots`) and queues requests instead of sending them to overloaded backends
- **Preemption prevention** — optional `max_models` limit per backend prevents llama.cpp from evicting a running model's KV cache when a new model request arrives
- **Model-affinity reordering** — queued requests for an already-loaded model can be promoted ahead of waiting requests (configurable with starvation protection)
- **Session affinity** — routes follow-up turns in a conversation back to the same backend, improving KV-cache reuse
- **Round-robin policy** — distributes load evenly across backends per model
- **Backend health polling** — continuously polls `/health` and `/v1/models`; removes dead backends from rotation automatically; resets slot counters on recovery
- **Backend failover** — if a backend becomes unreachable mid-request, the proxy automatically retries on another live backend with a free slot; catch-all (web UI) paths also retry across all live backends
- **Request queue** — FIFO queue with configurable timeout; returns 503 with a JSON error body when no slot becomes available in time
- **API key auth** — optional `Bearer` token validation; the proxy rewrites outbound auth to match each backend's own key
- **Monitor page** — self-contained HTML dashboard at `/monitor` (no CDN dependencies); auto-refreshes every 3 seconds
- **Catch-all proxy** — non-inference paths are forwarded best-effort to a live backend

## Requirements

- Python 3.13+
- llama.cpp server(s) with `--slots` endpoint enabled

## Installation

```bash
pip install .
# or in editable mode for development:
pip install -e ".[test]"
```

## Quick start

Copy the example config and edit it:

```bash
cp config.json.example config.json
llamacpp-ha --config config.json
```

The proxy starts on `http://0.0.0.0:8080` by default.

## Configuration

Configuration is a JSON file. All fields also accept environment variable overrides with the `LLAMACPP_HA_` prefix (nested fields use `__` as delimiter).

```json
{
  "host": "0.0.0.0",
  "port": 8080,
  "api_keys": ["your-secret-key"],
  "poll_interval": 5,
  "slot_wait_timeout": 30,
  "session_idle_ttl": 300,
  "default_slot_capacity": 1,
  "default_max_models": 1,
  "max_queue_skip": 4,
  "model_limits": {
    "my-large-model": 1
  },
  "backends": [
    {
      "url": "http://localhost:8081",
      "api_key": null,
      "model_ids": [],
      "max_models": 1
    },
    {
      "url": "http://localhost:8082",
      "api_key": "backend-secret",
      "model_ids": ["llama3"],
      "max_models": null
    }
  ]
}
```

### Global fields

| Field | Default | Description |
|---|---|---|
| `host` | `0.0.0.0` | Listen address |
| `port` | `8080` | Listen port |
| `api_keys` | `[]` | Accepted bearer tokens. Empty = no auth. |
| `poll_interval` | `5.0` | Seconds between backend health polls |
| `slot_wait_timeout` | `30.0` | Max seconds a request waits for a free slot |
| `session_idle_ttl` | `300.0` | Seconds before an idle session is evicted |
| `default_slot_capacity` | `1` | Initial slot count per backend used before the first `/slots` poll completes |
| `default_max_models` | `null` | Maximum concurrent models per backend (null = unlimited). Applied to backends that do not set their own `max_models`. |
| `max_queue_skip` | `0` | How many times a queued request may be bypassed by a model-affinity promotion before it is frozen at head-of-line. `0` disables reordering. |
| `model_unload_delay` | `3.0` | Seconds a backend stays sticky to its last model after all slots drain. Prevents unnecessary model swaps for follow-up requests (title generation, suggestions) that arrive shortly after the main response. `0` disables. |
| `model_limits` | `{}` | Per-model global concurrency cap across all backends (e.g. `{"my-large-model": 1}`). Use for models too large to run simultaneously due to RAM constraints. |

### Per-backend fields

| Field | Default | Description |
|---|---|---|
| `url` | required | Backend base URL |
| `api_key` | `null` | Injected as `Authorization: Bearer <key>` on outbound requests; client key is stripped |
| `model_ids` | `[]` | Override the model list instead of polling `/v1/models` |
| `max_models` | `null` | Maximum concurrent distinct models on this backend. Overrides `default_max_models`. `null` = unlimited. |

### Environment variable overrides

```bash
LLAMACPP_HA_PORT=9090 llamacpp-ha --config config.json
LLAMACPP_HA_API_KEYS='["key1","key2"]' llamacpp-ha --config config.json
```

## Preemption prevention (`max_models`)

llama.cpp evicts the current model's KV cache when a different model is loaded, which can interrupt an in-flight request. Setting `max_models: 1` on a backend tells the proxy to block requests for a second model until all slots for the first model are released.

```json
{
  "default_max_models": 1,
  "backends": [
    {"url": "http://localhost:8081"},
    {"url": "http://localhost:8082"}
  ]
}
```

With this configuration each backend serves exactly one model at a time. When both backends are busy with different models, a third request waits in the queue until a slot is freed on a compatible backend.

Set `default_max_models: 2` (or higher) to allow two models to share a backend's slots simultaneously when hardware permits.

## Global model concurrency limit (`model_limits`)

Some models are large enough that running two instances simultaneously would exhaust system RAM (e.g. a 70 B model that needs CPU offloading). `model_limits` caps the total number of in-flight requests for a specific model across **all** backends combined.

```json
{
  "model_limits": {
    "my-70b-model": 1
  }
}
```

When the cap is reached the proxy queues further requests for that model exactly as it does when a backend's slots are full. The request is dispatched as soon as the running instance completes and releases its slot. `slot_wait_timeout` still applies — a queued request that cannot be dispatched within the timeout receives a 503.

`model_limits` is complementary to `max_models`: `max_models` prevents a single backend from loading a second model (preemption prevention), while `model_limits` limits how many requests for one model can be active across the entire cluster at once.

## Model-affinity reordering (`max_queue_skip`)

By default the queue is strict FIFO. Setting `max_queue_skip` to a positive integer enables a two-phase dispatch:

1. **Affinity pass** — scans the queue for requests whose model is already active on a free backend. Matching requests are promoted and dispatched immediately, bypassing earlier entries.
2. **FIFO pass** — remaining entries are dispatched in arrival order.

Each time a request is bypassed, its `skip_count` is incremented. Once `skip_count` reaches `max_queue_skip` the entry is frozen at head-of-line and blocks the affinity pass — preventing indefinite starvation.

```json
{
  "max_queue_skip": 4
}
```

This trades strict fairness for throughput: a warm model serves back-to-back requests without stalling for a cold-start on another model, while the `max_queue_skip` cap guarantees every request is eventually served.

## CLI reference

```
llamacpp-ha [--config PATH] [--host HOST] [--port PORT] [--log-level LEVEL]
```

`--config` defaults to `config.json` in the current directory. `--host` and `--port` override the values in the config file.

## API endpoints

| Method | Path | Description |
|---|---|---|
| `GET` | `/health` | Returns `{"status":"ok"}` if at least one backend is live, 503 otherwise |
| `GET` | `/v1/models` | Aggregated model list across all live backends |
| `POST` | `/v1/chat/completions` | Slot-gated, session-aware inference (streaming supported) |
| `POST` | `/v1/completions` | Same as above |
| `POST` | `/v1/embeddings` | Slot-gated pass-through |
| `POST` | `/v1/images/*`, `/v1/audio/*` | Slot-gated pass-through |
| `*` | `/*` | Best-effort forward to any live backend |
| `GET` | `/monitor` | HTML dashboard |
| `GET` | `/monitor/data` | Dashboard data as JSON (exempt from API key auth) |

### Session affinity

The proxy assigns a session ID to every request. It is sent back via both a cookie (`x-llm-session`) and a response header (`X-Session-ID`). Clients can echo either on subsequent requests to pin their conversation to the same backend. The affinity record expires after `session_idle_ttl` seconds of inactivity.

For clients that do not echo a session identifier, the proxy attempts to recover affinity automatically by hashing the incoming messages array and comparing it against stored conversation prefixes. If a match is found, the request is routed to the backend that holds the corresponding KV-cache. The longest match (highest message index) takes precedence. In the unlikely event of a hash collision the request is sent to the wrong backend; the next request will rebalance normally.

### Streaming

SSE streaming responses (`text/event-stream`) are passed through transparently. The backend slot is held for the duration of the stream and released when the final chunk is sent.

## Monitor

Open `http://localhost:8080/monitor` in a browser. The page polls `/monitor/data` every 3 seconds and shows:

- Uptime, total inference requests served (catch-all and web UI paths are excluded), queue depth, active sessions, live backend count
- Per-backend: URL, live/dead status, models, slot usage (`acquired/total`), time since last poll
- Current queue contents with wait time and estimated token count
- Active sessions grouped by model

The monitor page and its data endpoint are exempt from API key authentication.

## Architecture

```
client
  │
  ▼
ApiKeyMiddleware
  │
  ▼
FastAPI app (proxy.py)
  ├── GET /v1/models         ──► BackendRegistry.get_all_models()
  ├── POST /v1/chat/...      ──► RequestQueue ──► Scheduler ──► SlotTracker
  │                                                              │
  │                                               BackendState ◄─┘
  │                                                    │
  │                                               forwarder.py ──► aiohttp ──► backend
  ├── GET /health
  ├── GET /monitor[/data]    ──► monitor.py
  └── /* catch-all           ──► forwarder.forward_best_effort()

BackendRegistry              polls /health + /v1/models + /slots every poll_interval;
                             calls on_backend_recovered when a dead backend comes back live
SlotTracker                  asyncio.Condition per backend; acquire blocks until slot free;
                             enforces max_models (preemption prevention)
SessionStore                 SHA-256 prefix hash → preferred backend URL; TTL eviction
RequestQueue                 FIFO asyncio.Future queue; asyncio.Event for wakeup
Scheduler                    two-phase dispatch (affinity pass + FIFO); N-skip reordering
RoundRobinPolicy             per-model atomic counter for backend selection
```

## Development

```bash
# Install with test dependencies
pip install -e ".[test]"

# Run all tests
python -m pytest

# Run a specific module
python -m pytest tests/test_scheduler.py -v
```

Tests use `unittest.IsolatedAsyncioTestCase` for async tests and Starlette's `TestClient` as a context manager for integration tests that require the full lifespan (aiohttp session, scheduler, registry).

## License

MIT