llamacpp-ha/AGENTS.md

# llamacpp-ha — Agent instructions

## Project overview

`llamacpp-ha` is a slot-aware load balancer for llama.cpp servers. It exposes a single OpenAI-compatible HTTP API and distributes inference requests across multiple backends using a global request queue, per-backend slot tracking, and session affinity. The codebase is pure Python 3.13 async with FastAPI + aiohttp.

## Commands

```bash
# Install (editable, with test deps)
pip install -e ".[test]"

# Run all tests
python -m pytest

# Run a single test file
python -m pytest tests/test_scheduler.py -v

# Start the proxy
llamacpp-ha --config config.json
```

## Architecture

All source lives in `src/llamacpp_ha/`. Module responsibilities:

| Module | Responsibility |
|---|---|
| `config.py` | `BackendConfig` + `ProxyConfig` (pydantic-settings, `LLAMACPP_HA_` env prefix) |
| `registry.py` | Polls `/health`, `/v1/models`, `/slots` on each backend; maintains live/dead state |
| `slot_tracker.py` | Per-backend `asyncio.Condition`; `acquire()` blocks until a slot is free |
| `queue.py` | FIFO `asyncio.Future` queue; `asyncio.Event` wakeup for the scheduler |
| `scheduler.py` | Drains the queue, resolves futures with a chosen `BackendState`; `notify_slot_released()` triggers re-drain |
| `policies.py` | `RoundRobinPolicy` — per-model atomic counter over `BackendCandidate` list |
| `session_store.py` | SHA-256 prefix hash → preferred backend; TTL eviction |
| `forwarder.py` | aiohttp outbound request; streaming SSE via `iter_chunked`; releases slot in `finally` |
| `middleware.py` | `ApiKeyMiddleware` — `Bearer` token validation; `/monitor` and `/monitor/data` are exempt |
| `monitor.py` | `ProxyStats` dataclass + `build_router()` for `/monitor` (HTML) and `/monitor/data` (JSON) |
| `proxy.py` | `create_app()` — wires all components; registers FastAPI routes; lifespan manages tasks |
| `__main__.py` | CLI entry point (`llamacpp-ha` script) |

## Critical design invariants

**Slot release must be awaited, not task-spawned.** `SlotTracker.release()` is `async` and must be `await`ed directly. Scheduling it as a task causes a race where the scheduler checks `has_free_slot()` before the release runs. This applies in `forwarder.py` (both the streaming `finally` block and the error path).

**`QueueEntry.future` has no default.** The `future` field is `None` by default. Callers must create it explicitly: `entry.future = asyncio.get_running_loop().create_future()`. Never use `asyncio.get_event_loop()` — it is deprecated in async contexts.

**`build_router()` creates a new `APIRouter` instance each call.** The router must not be a module-level singleton; doing so causes route handlers to share state across test cases.

**`ProxyStats` is per-app.** Created in `create_app()` and passed into `build_router()`. Never use module-level counters or timestamps in `monitor.py`.

**`scheduler.start()` is `def`, not `async def`.** It creates a task; it does not need to be awaited.

**`asyncio.shield` in `_dispatch`.** The queue entry future is shielded so that a client timeout doesn't cancel the backend dispatch. On timeout, the entry is removed from the queue and the future is cancelled manually.

## Testing

- Tests are in `tests/` using `unittest` (not pytest-native style), with `IsolatedAsyncioTestCase` for async tests.
- `pytest.ini_options` sets `asyncio_mode = "auto"` so async test methods run without additional decorators.
- Integration tests (`test_integration.py`) use `with TestClient(app) as client:` — the context manager form is required to run the FastAPI lifespan (which starts the scheduler and opens the aiohttp session).
- Test helpers: `_entry(**kwargs)` creates a `QueueEntry` with a future attached via `asyncio.get_running_loop().create_future()`.
- All 113 tests must pass: `python -m pytest` should exit 0.

## Style

- Python 3.13+; use `from __future__ import annotations` in all modules.
- No comments unless the WHY is non-obvious (hidden constraint, asyncio race, workaround).
- No docstrings on internal helpers.
- Type annotations on all public functions and dataclass fields.
- `asyncio.get_running_loop()` everywhere; never `asyncio.get_event_loop()`.