core / verifier.py — Verifier
Reflective self-evaluation: a separate LLM call (worker pool when possible) confirms / refutes the agent's conclusion before it reaches the user.
API
| Type | Fields / signature |
|---|---|
VerifyVerdict (line 24) | CONFIRMED · REFUTED · UNCERTAIN |
VerifyResult (line 31) | verdict · confidence (0–1) · reasoning · issues |
async verify_claim(claim, evidence, context) | Generic textual verification (line 237). |
async verify_code_output(code, output, intent) | Compare exec output to declared intent (line 248). |
async adversarial_probe(problem, solution) -> List[dict] | Generate 3-5 edge cases the solution might fail on (line 259). |
Routing
Tries llm_client.route(task="VERIFY", max_tokens=2048, ...) against the worker pool first — both cheaper and a different "perspective" than the foreground model. Falls back to a direct foreground call if the worker pool is empty or its circuit is open.
JSON parse robustness
The verifier strips any <think>…</think> blocks the worker model emits, then attempts: direct parse → reversed greedy match → final greedy match. Returns None if no parse succeeds — callers treat None as "skip verification" rather than "verdict unknown".
Concurrency
Async; safe to call concurrently per request because the underlying httpx client per node already pools connections.
Executed-code reconstruction from tool_call_id
When the verifier gate in handle_chat classifies a turn as code-shape (last substantive tool was execute / postgres_*), it calls verify_code_output(code, output, intent). The code slot MUST be the actual program that was submitted.
Prior to the fix, the call site at agent.py:4835 passed code=tool_name — literally the string "execute". With nothing real to audit, the verifier LLM hallucinated plausible-sounding rejections ("appears to be a directory listing", "missing expected field") and emitted REFUTED (95%) verdicts on CORRECT turns. The user then saw Verifier note: … contradicting their right answer.
The fix is _reconstruct_executed_code(messages, tool_msg) in agent.py: walk messages backwards from the tool result, match tool_call_id to the assistant's tool_calls[i].function.arguments, and extract content / code / script_content / text / command / cmd (first non-empty wins). Capped at 4000 chars. When the lookup fails — malformed arguments, missing id, non-dict messages — the gate falls back to verify_claim, which doesn't need a code slot.
Covered by tests/test_verifier_code_reconstruction.py (14 cases including dict-vs-string arguments, alias preference, traversal past unrelated messages, oversize cap, identity-based dedup on duplicate tool-call ids).