Files
llamacpp-ha/AGENTS.md
2026-05-17 09:54:18 +02:00

4.1 KiB
Raw Blame History

llamacpp-ha — Agent instructions

Project overview

llamacpp-ha is a slot-aware load balancer for llama.cpp servers. It exposes a single OpenAI-compatible HTTP API and distributes inference requests across multiple backends using a global request queue, per-backend slot tracking, and session affinity. The codebase is pure Python 3.13 async with FastAPI + aiohttp.

Commands

# Install (editable, with test deps)
pip install -e ".[test]"

# Run all tests
python -m pytest

# Run a single test file
python -m pytest tests/test_scheduler.py -v

# Start the proxy
llamacpp-ha --config config.json

Architecture

All source lives in src/llamacpp_ha/. Module responsibilities:

Module Responsibility
config.py BackendConfig + ProxyConfig (pydantic-settings, LLAMACPP_HA_ env prefix)
registry.py Polls /health, /v1/models, /slots on each backend; maintains live/dead state
slot_tracker.py Per-backend asyncio.Condition; acquire() blocks until a slot is free
queue.py FIFO asyncio.Future queue; asyncio.Event wakeup for the scheduler
scheduler.py Drains the queue, resolves futures with a chosen BackendState; notify_slot_released() triggers re-drain
policies.py RoundRobinPolicy — per-model atomic counter over BackendCandidate list
session_store.py SHA-256 prefix hash → preferred backend; TTL eviction
forwarder.py aiohttp outbound request; streaming SSE via iter_chunked; releases slot in finally
middleware.py ApiKeyMiddlewareBearer token validation; /monitor and /monitor/data are exempt
monitor.py ProxyStats dataclass + build_router() for /monitor (HTML) and /monitor/data (JSON)
proxy.py create_app() — wires all components; registers FastAPI routes; lifespan manages tasks
__main__.py CLI entry point (llamacpp-ha script)

Critical design invariants

Slot release must be awaited, not task-spawned. SlotTracker.release() is async and must be awaited directly. Scheduling it as a task causes a race where the scheduler checks has_free_slot() before the release runs. This applies in forwarder.py (both the streaming finally block and the error path).

QueueEntry.future has no default. The future field is None by default. Callers must create it explicitly: entry.future = asyncio.get_running_loop().create_future(). Never use asyncio.get_event_loop() — it is deprecated in async contexts.

build_router() creates a new APIRouter instance each call. The router must not be a module-level singleton; doing so causes route handlers to share state across test cases.

ProxyStats is per-app. Created in create_app() and passed into build_router(). Never use module-level counters or timestamps in monitor.py.

scheduler.start() is def, not async def. It creates a task; it does not need to be awaited.

asyncio.shield in _dispatch. The queue entry future is shielded so that a client timeout doesnt cancel the backend dispatch. On timeout, the entry is removed from the queue and the future is cancelled manually.

Testing

  • Tests are in tests/ using unittest (not pytest-native style), with IsolatedAsyncioTestCase for async tests.
  • pytest.ini_options sets asyncio_mode = "auto" so async test methods run without additional decorators.
  • Integration tests (test_integration.py) use with TestClient(app) as client: — the context manager form is required to run the FastAPI lifespan (which starts the scheduler and opens the aiohttp session).
  • Test helpers: _entry(**kwargs) creates a QueueEntry with a future attached via asyncio.get_running_loop().create_future().
  • All 113 tests must pass: python -m pytest should exit 0.

Style

  • Python 3.13+; use from __future__ import annotations in all modules.
  • No comments unless the WHY is non-obvious (hidden constraint, asyncio race, workaround).
  • No docstrings on internal helpers.
  • Type annotations on all public functions and dataclass fields.
  • asyncio.get_running_loop() everywhere; never asyncio.get_event_loop().