core / hypothesis.py — HypothesisTester

Parallel-debugging engine: generate competing hypotheses, test them concurrently, ask the LLM to rank survivors.

Why

Serial debugging burns 6-8 turns hopping between guesses. Parallelising the test phase collapses the same exploration into 2-3 turns: the worker pool runs every candidate test simultaneously, eliminates obvious losers via heuristic markers, then a single LLM call ranks what's left.

Data structures

Type	Fields
`Hypothesis` (line 21)	description · test_action · test_tool · result · consistent · confidence

Methods

Method	Purpose
`__init__(llm_client, max_candidates=3, max_depth=2)`	Bind LLM client and limits.
`async generate_hypotheses(problem, context, error_output) -> List[Hypothesis]`	LLM produces 3-5 candidates with description and recommended test (line 88).
`async test_hypotheses_parallel(hypotheses, executor) -> List[Hypothesis]`	Run tests in parallel, set `consistent=True` when no error markers detected (line 128).
`async evaluate_results(problem, hypotheses) -> dict`	LLM ranks survivors; returns most likely root cause + remediation (line 167).
`async backtrack() -> Optional[Hypothesis]`	Pop next-best alternative when current pick fails (line 162).

Heuristic consistency check

If the test result lacks any of error · exception · traceback · not found, the hypothesis is provisionally marked consistent and forwarded to LLM evaluation. This avoids wasting an LLM call on hypotheses that are clearly wrong.

Concurrency

Fully async; asyncio.gather for parallel tests. The executor argument allows callers to inject a sandbox-aware test runner.