71 lines
4.1 KiB
Markdown
71 lines
4.1 KiB
Markdown
# llamacpp-ha — Agent instructions
|
|
|
|
## Project overview
|
|
|
|
`llamacpp-ha` is a slot-aware load balancer for llama.cpp servers. It exposes a single OpenAI-compatible HTTP API and distributes inference requests across multiple backends using a global request queue, per-backend slot tracking, and session affinity. The codebase is pure Python 3.13 async with FastAPI + aiohttp.
|
|
|
|
## Commands
|
|
|
|
```bash
|
|
# Install (editable, with test deps)
|
|
pip install -e ".[test]"
|
|
|
|
# Run all tests
|
|
python -m pytest
|
|
|
|
# Run a single test file
|
|
python -m pytest tests/test_scheduler.py -v
|
|
|
|
# Start the proxy
|
|
llamacpp-ha --config config.json
|
|
```
|
|
|
|
## Architecture
|
|
|
|
All source lives in `src/llamacpp_ha/`. Module responsibilities:
|
|
|
|
| Module | Responsibility |
|
|
|---|---|
|
|
| `config.py` | `BackendConfig` + `ProxyConfig` (pydantic-settings, `LLAMACPP_HA_` env prefix) |
|
|
| `registry.py` | Polls `/health`, `/v1/models`, `/slots` on each backend; maintains live/dead state |
|
|
| `slot_tracker.py` | Per-backend `asyncio.Condition`; `acquire()` blocks until a slot is free |
|
|
| `queue.py` | FIFO `asyncio.Future` queue; `asyncio.Event` wakeup for the scheduler |
|
|
| `scheduler.py` | Drains the queue, resolves futures with a chosen `BackendState`; `notify_slot_released()` triggers re-drain |
|
|
| `policies.py` | `RoundRobinPolicy` — per-model atomic counter over `BackendCandidate` list |
|
|
| `session_store.py` | SHA-256 prefix hash → preferred backend; TTL eviction |
|
|
| `forwarder.py` | aiohttp outbound request; streaming SSE via `iter_chunked`; releases slot in `finally` |
|
|
| `middleware.py` | `ApiKeyMiddleware` — `Bearer` token validation; `/monitor` and `/monitor/data` are exempt |
|
|
| `monitor.py` | `ProxyStats` dataclass + `build_router()` for `/monitor` (HTML) and `/monitor/data` (JSON) |
|
|
| `proxy.py` | `create_app()` — wires all components; registers FastAPI routes; lifespan manages tasks |
|
|
| `__main__.py` | CLI entry point (`llamacpp-ha` script) |
|
|
|
|
## Critical design invariants
|
|
|
|
**Slot release must be awaited, not task-spawned.** `SlotTracker.release()` is `async` and must be `await`ed directly. Scheduling it as a task causes a race where the scheduler checks `has_free_slot()` before the release runs. This applies in `forwarder.py` (both the streaming `finally` block and the error path).
|
|
|
|
**`QueueEntry.future` has no default.** The `future` field is `None` by default. Callers must create it explicitly: `entry.future = asyncio.get_running_loop().create_future()`. Never use `asyncio.get_event_loop()` — it is deprecated in async contexts.
|
|
|
|
**`build_router()` creates a new `APIRouter` instance each call.** The router must not be a module-level singleton; doing so causes route handlers to share state across test cases.
|
|
|
|
**`ProxyStats` is per-app.** Created in `create_app()` and passed into `build_router()`. Never use module-level counters or timestamps in `monitor.py`.
|
|
|
|
**`scheduler.start()` is `def`, not `async def`.** It creates a task; it does not need to be awaited.
|
|
|
|
**`asyncio.shield` in `_dispatch`.** The queue entry future is shielded so that a client timeout doesn't cancel the backend dispatch. On timeout, the entry is removed from the queue and the future is cancelled manually.
|
|
|
|
## Testing
|
|
|
|
- Tests are in `tests/` using `unittest` (not pytest-native style), with `IsolatedAsyncioTestCase` for async tests.
|
|
- `pytest.ini_options` sets `asyncio_mode = "auto"` so async test methods run without additional decorators.
|
|
- Integration tests (`test_integration.py`) use `with TestClient(app) as client:` — the context manager form is required to run the FastAPI lifespan (which starts the scheduler and opens the aiohttp session).
|
|
- Test helpers: `_entry(**kwargs)` creates a `QueueEntry` with a future attached via `asyncio.get_running_loop().create_future()`.
|
|
- All 113 tests must pass: `python -m pytest` should exit 0.
|
|
|
|
## Style
|
|
|
|
- Python 3.13+; use `from __future__ import annotations` in all modules.
|
|
- No comments unless the WHY is non-obvious (hidden constraint, asyncio race, workaround).
|
|
- No docstrings on internal helpers.
|
|
- Type annotations on all public functions and dataclass fields.
|
|
- `asyncio.get_running_loop()` everywhere; never `asyncio.get_event_loop()`.
|