# llamacpp-ha — Agent instructions ## Project overview `llamacpp-ha` is a slot-aware load balancer for llama.cpp servers. It exposes a single OpenAI-compatible HTTP API and distributes inference requests across multiple backends using a global request queue, per-backend slot tracking, and session affinity. The codebase is pure Python 3.13 async with FastAPI + aiohttp. ## Commands ```bash # Install (editable, with test deps) pip install -e ".[test]" # Run all tests python -m pytest # Run a single test file python -m pytest tests/test_scheduler.py -v # Start the proxy llamacpp-ha --config config.json ``` ## Architecture All source lives in `src/llamacpp_ha/`. Module responsibilities: | Module | Responsibility | |---|---| | `config.py` | `BackendConfig` + `ProxyConfig` (pydantic-settings, `LLAMACPP_HA_` env prefix) | | `registry.py` | Polls `/health`, `/v1/models`, `/slots` on each backend; maintains live/dead state | | `slot_tracker.py` | Per-backend `asyncio.Condition`; `acquire()` blocks until a slot is free | | `queue.py` | FIFO `asyncio.Future` queue; `asyncio.Event` wakeup for the scheduler | | `scheduler.py` | Drains the queue, resolves futures with a chosen `BackendState`; `notify_slot_released()` triggers re-drain | | `policies.py` | `RoundRobinPolicy` — per-model atomic counter over `BackendCandidate` list | | `session_store.py` | SHA-256 prefix hash → preferred backend; TTL eviction | | `forwarder.py` | aiohttp outbound request; streaming SSE via `iter_chunked`; releases slot in `finally` | | `middleware.py` | `ApiKeyMiddleware` — `Bearer` token validation; `/monitor` and `/monitor/data` are exempt | | `monitor.py` | `ProxyStats` dataclass + `build_router()` for `/monitor` (HTML) and `/monitor/data` (JSON) | | `proxy.py` | `create_app()` — wires all components; registers FastAPI routes; lifespan manages tasks | | `__main__.py` | CLI entry point (`llamacpp-ha` script) | ## Critical design invariants **Slot release must be awaited, not task-spawned.** `SlotTracker.release()` is `async` and must be `await`ed directly. Scheduling it as a task causes a race where the scheduler checks `has_free_slot()` before the release runs. This applies in `forwarder.py` (both the streaming `finally` block and the error path). **`QueueEntry.future` has no default.** The `future` field is `None` by default. Callers must create it explicitly: `entry.future = asyncio.get_running_loop().create_future()`. Never use `asyncio.get_event_loop()` — it is deprecated in async contexts. **`build_router()` creates a new `APIRouter` instance each call.** The router must not be a module-level singleton; doing so causes route handlers to share state across test cases. **`ProxyStats` is per-app.** Created in `create_app()` and passed into `build_router()`. Never use module-level counters or timestamps in `monitor.py`. **`scheduler.start()` is `def`, not `async def`.** It creates a task; it does not need to be awaited. **`asyncio.shield` in `_dispatch`.** The queue entry future is shielded so that a client timeout doesn't cancel the backend dispatch. On timeout, the entry is removed from the queue and the future is cancelled manually. ## Testing - Tests are in `tests/` using `unittest` (not pytest-native style), with `IsolatedAsyncioTestCase` for async tests. - `pytest.ini_options` sets `asyncio_mode = "auto"` so async test methods run without additional decorators. - Integration tests (`test_integration.py`) use `with TestClient(app) as client:` — the context manager form is required to run the FastAPI lifespan (which starts the scheduler and opens the aiohttp session). - Test helpers: `_entry(**kwargs)` creates a `QueueEntry` with a future attached via `asyncio.get_running_loop().create_future()`. - All 113 tests must pass: `python -m pytest` should exit 0. ## Style - Python 3.13+; use `from __future__ import annotations` in all modules. - No comments unless the WHY is non-obvious (hidden constraint, asyncio race, workaround). - No docstrings on internal helpers. - Type annotations on all public functions and dataclass fields. - `asyncio.get_running_loop()` everywhere; never `asyncio.get_event_loop()`.