4.1 KiB
llamacpp-ha — Agent instructions
Project overview
llamacpp-ha is a slot-aware load balancer for llama.cpp
servers. It exposes a single OpenAI-compatible HTTP API and distributes
inference requests across multiple backends using a global request
queue, per-backend slot tracking, and session affinity. The codebase is
pure Python 3.13 async with FastAPI + aiohttp.
Commands
# Install (editable, with test deps)
pip install -e ".[test]"
# Run all tests
python -m pytest
# Run a single test file
python -m pytest tests/test_scheduler.py -v
# Start the proxy
llamacpp-ha --config config.jsonArchitecture
All source lives in src/llamacpp_ha/. Module
responsibilities:
| Module | Responsibility |
|---|---|
config.py |
BackendConfig + ProxyConfig
(pydantic-settings, LLAMACPP_HA_ env prefix) |
registry.py |
Polls /health, /v1/models,
/slots on each backend; maintains live/dead state |
slot_tracker.py |
Per-backend asyncio.Condition; acquire()
blocks until a slot is free |
queue.py |
FIFO asyncio.Future queue; asyncio.Event
wakeup for the scheduler |
scheduler.py |
Drains the queue, resolves futures with a chosen
BackendState; notify_slot_released() triggers
re-drain |
policies.py |
RoundRobinPolicy — per-model atomic counter over
BackendCandidate list |
session_store.py |
SHA-256 prefix hash → preferred backend; TTL eviction |
forwarder.py |
aiohttp outbound request; streaming SSE via
iter_chunked; releases slot in finally |
middleware.py |
ApiKeyMiddleware — Bearer token
validation; /monitor and /monitor/data are
exempt |
monitor.py |
ProxyStats dataclass + build_router() for
/monitor (HTML) and /monitor/data (JSON) |
proxy.py |
create_app() — wires all components; registers FastAPI
routes; lifespan manages tasks |
__main__.py |
CLI entry point (llamacpp-ha script) |
Critical design invariants
Slot release must be awaited, not task-spawned.
SlotTracker.release() is async and must be
awaited directly. Scheduling it as a task causes a race
where the scheduler checks has_free_slot() before the
release runs. This applies in forwarder.py (both the
streaming finally block and the error path).
QueueEntry.future has no default. The
future field is None by default. Callers must
create it explicitly:
entry.future = asyncio.get_running_loop().create_future().
Never use asyncio.get_event_loop() — it is deprecated in
async contexts.
build_router() creates a new
APIRouter instance each call. The router must not
be a module-level singleton; doing so causes route handlers to share
state across test cases.
ProxyStats is per-app. Created in
create_app() and passed into build_router().
Never use module-level counters or timestamps in
monitor.py.
scheduler.start() is def, not
async def. It creates a task; it does not need to
be awaited.
asyncio.shield in
_dispatch. The queue entry future is shielded so
that a client timeout doesn’t cancel the backend dispatch. On timeout,
the entry is removed from the queue and the future is cancelled
manually.
Testing
- Tests are in
tests/usingunittest(not pytest-native style), withIsolatedAsyncioTestCasefor async tests. pytest.ini_optionssetsasyncio_mode = "auto"so async test methods run without additional decorators.- Integration tests (
test_integration.py) usewith TestClient(app) as client:— the context manager form is required to run the FastAPI lifespan (which starts the scheduler and opens the aiohttp session). - Test helpers:
_entry(**kwargs)creates aQueueEntrywith a future attached viaasyncio.get_running_loop().create_future(). - All 113 tests must pass:
python -m pytestshould exit 0.
Style
- Python 3.13+; use
from __future__ import annotationsin all modules. - No comments unless the WHY is non-obvious (hidden constraint, asyncio race, workaround).
- No docstrings on internal helpers.
- Type annotations on all public functions and dataclass fields.
asyncio.get_running_loop()everywhere; neverasyncio.get_event_loop().