llamacpp-ha — Agent instructions

Project overview

llamacpp-ha is a slot-aware load balancer for llama.cpp servers. It exposes a single OpenAI-compatible HTTP API and distributes inference requests across multiple backends using a global request queue, per-backend slot tracking, and session affinity. The codebase is pure Python 3.13 async with FastAPI + aiohttp.

Commands

# Install (editable, with test deps)
pip install -e ".[test]"

# Run all tests
python -m pytest

# Run a single test file
python -m pytest tests/test_scheduler.py -v

# Start the proxy
llamacpp-ha --config config.json

Architecture

All source lives in src/llamacpp_ha/. Module responsibilities:

Module	Responsibility
`config.py`	`BackendConfig` + `ProxyConfig` (pydantic-settings, `LLAMACPP_HA_` env prefix)
`registry.py`	Polls `/health`, `/v1/models`, `/slots` on each backend; maintains live/dead state
`slot_tracker.py`	Per-backend `asyncio.Condition`; `acquire()` blocks until a slot is free
`queue.py`	FIFO `asyncio.Future` queue; `asyncio.Event` wakeup for the scheduler
`scheduler.py`	Drains the queue, resolves futures with a chosen `BackendState`; `notify_slot_released()` triggers re-drain
`policies.py`	`RoundRobinPolicy` — per-model atomic counter over `BackendCandidate` list
`session_store.py`	SHA-256 prefix hash → preferred backend; TTL eviction
`forwarder.py`	aiohttp outbound request; streaming SSE via `iter_chunked`; releases slot in `finally`
`middleware.py`	`ApiKeyMiddleware` — `Bearer` token validation; `/monitor` and `/monitor/data` are exempt
`monitor.py`	`ProxyStats` dataclass + `build_router()` for `/monitor` (HTML) and `/monitor/data` (JSON)
`proxy.py`	`create_app()` — wires all components; registers FastAPI routes; lifespan manages tasks
`__main__.py`	CLI entry point (`llamacpp-ha` script)

Critical design invariants

Slot release must be awaited, not task-spawned. SlotTracker.release() is async and must be awaited directly. Scheduling it as a task causes a race where the scheduler checks has_free_slot() before the release runs. This applies in forwarder.py (both the streaming finally block and the error path).

QueueEntry.future has no default. The future field is None by default. Callers must create it explicitly: entry.future = asyncio.get_running_loop().create_future(). Never use asyncio.get_event_loop() — it is deprecated in async contexts.

build_router() creates a new APIRouter instance each call. The router must not be a module-level singleton; doing so causes route handlers to share state across test cases.

ProxyStats is per-app. Created in create_app() and passed into build_router(). Never use module-level counters or timestamps in monitor.py.

scheduler.start() is def, not async def. It creates a task; it does not need to be awaited.

asyncio.shield in _dispatch. The queue entry future is shielded so that a client timeout doesn’t cancel the backend dispatch. On timeout, the entry is removed from the queue and the future is cancelled manually.

Testing

Tests are in tests/ using unittest (not pytest-native style), with IsolatedAsyncioTestCase for async tests.
pytest.ini_options sets asyncio_mode = "auto" so async test methods run without additional decorators.
Integration tests (test_integration.py) use with TestClient(app) as client: — the context manager form is required to run the FastAPI lifespan (which starts the scheduler and opens the aiohttp session).
Test helpers: _entry(**kwargs) creates a QueueEntry with a future attached via asyncio.get_running_loop().create_future().
All 113 tests must pass: python -m pytest should exit 0.

Style

Python 3.13+; use from __future__ import annotations in all modules.
No comments unless the WHY is non-obvious (hidden constraint, asyncio race, workaround).
No docstrings on internal helpers.
Type annotations on all public functions and dataclass fields.
asyncio.get_running_loop() everywhere; never asyncio.get_event_loop().

4.1 KiB Raw Permalink Blame History Unescape Escape