Circuit breaker for upstream LLM nodes
Fault isolation across the foreground and pool nodes managed by LLMClient.
States
| State | Behaviour | Transition out |
|---|---|---|
| CLOSED | Healthy. Requests dispatched normally. | 3 consecutive failures → OPEN. |
| OPEN | Tripped. Node skipped on round-robin selection. | After 60 s cooldown → HALF. |
| HALF | Probing. Single request allowed. | Success → CLOSED. Failure → OPEN. |
Pool selection
Round-robin across each pool's nodes. get_worker_node(), get_vision_node(), get_coding_node(), get_image_gen_node() all skip nodes whose breaker is OPEN. If every node in a pool is OPEN, the call falls back to the foreground client (with the same circuit-breaker treatment).
Configuration
NodeCircuitBreaker(failure_threshold=3, cooldown_seconds=60.0)
State is process-local; restart resets every breaker to CLOSED. There is no persistence file because pool nodes themselves typically come and go between restarts.
Failure source
The breaker records both transport failures (httpx exceptions, 5xx) and per-call timeouts (default 1200 s on chat completion calls). Application errors (401, 403) trip the breaker too — the node is unreachable from this process's perspective.
Where it shows up in tools
delegate_to_swarm records success/failure on the corresponding node directly so background workers also feed the breaker.