Circuit breaker for upstream LLM nodes

Fault isolation across the foreground and pool nodes managed by LLMClient.

States

StateBehaviourTransition out
CLOSEDHealthy. Requests dispatched normally.3 consecutive failures → OPEN.
OPENTripped. Node skipped on round-robin selection.After 60 s cooldown → HALF.
HALFProbing. Single request allowed.Success → CLOSED. Failure → OPEN.

Pool selection

Round-robin across each pool's nodes. get_worker_node(), get_vision_node(), get_coding_node(), get_image_gen_node() all skip nodes whose breaker is OPEN. If every node in a pool is OPEN, the call falls back to the foreground client (with the same circuit-breaker treatment).

Configuration

NodeCircuitBreaker(failure_threshold=3, cooldown_seconds=60.0)

State is process-local; restart resets every breaker to CLOSED. There is no persistence file because pool nodes themselves typically come and go between restarts.

Failure source

The breaker records both transport failures (httpx exceptions, 5xx) and per-call timeouts (default 1200 s on chat completion calls). Application errors (401, 403) trip the breaker too — the node is unreachable from this process's perspective.

Where it shows up in tools

delegate_to_swarm records success/failure on the corresponding node directly so background workers also feed the breaker.