core / verifier.py — Verifier

Reflective self-evaluation: a separate LLM call (worker pool when possible) confirms / refutes the agent's conclusion before it reaches the user.

API

TypeFields / signature
VerifyVerdict (line 24)CONFIRMED · REFUTED · UNCERTAIN
VerifyResult (line 31)verdict · confidence (0–1) · reasoning · issues
async verify_claim(claim, evidence, context)Generic textual verification (line 237).
async verify_code_output(code, output, intent)Compare exec output to declared intent (line 248).
async adversarial_probe(problem, solution) -> List[dict]Generate 3-5 edge cases the solution might fail on (line 259).

Routing

Tries llm_client.route(task="VERIFY", max_tokens=2048, ...) against the worker pool first — both cheaper and a different "perspective" than the foreground model. Falls back to a direct foreground call if the worker pool is empty or its circuit is open.

JSON parse robustness

The verifier strips any <think>…</think> blocks the worker model emits, then attempts: direct parse → reversed greedy match → final greedy match. Returns None if no parse succeeds — callers treat None as "skip verification" rather than "verdict unknown".

Concurrency

Async; safe to call concurrently per request because the underlying httpx client per node already pools connections.

Executed-code reconstruction from tool_call_id

When the verifier gate in handle_chat classifies a turn as code-shape (last substantive tool was execute / postgres_*), it calls verify_code_output(code, output, intent). The code slot MUST be the actual program that was submitted.

Prior to the fix, the call site at agent.py:4835 passed code=tool_name — literally the string "execute". With nothing real to audit, the verifier LLM hallucinated plausible-sounding rejections ("appears to be a directory listing", "missing expected field") and emitted REFUTED (95%) verdicts on CORRECT turns. The user then saw Verifier note: … contradicting their right answer.

The fix is _reconstruct_executed_code(messages, tool_msg) in agent.py: walk messages backwards from the tool result, match tool_call_id to the assistant's tool_calls[i].function.arguments, and extract content / code / script_content / text / command / cmd (first non-empty wins). Capped at 4000 chars. When the lookup fails — malformed arguments, missing id, non-dict messages — the gate falls back to verify_claim, which doesn't need a code slot.

Covered by tests/test_verifier_code_reconstruction.py (14 cases including dict-vs-string arguments, alias preference, traversal past unrelated messages, oversize cap, identity-based dedup on duplicate tool-call ids).