core / dream.py

Idle-time consolidation, synthetic challenge generation, and the full self-play orchestration pipeline.

Responsibilities

Mine failed tasks from the journal into standalone, self-contained challenges via journal_challenges.
Validate LLM-generated challenges before running them — quality gate, preflight, and validator self-test gate.
Detect cross-episode tool patterns that should become composed skills.
Orchestrate the self-play simulation loop: spawn an isolated temp agent, run the challenge, score with correctness_weighted_score, verify and persist any new lesson via SkillMemory.

Validation stack (three gates, in order)

1. `validate_challenge_quality(setup, validator) → (bool, reason)`

Pattern-based quality gate (line 102). Rejects:

Validators that call random.seed, random.randint, random.uniform, random.choice or np.random — data-generation markers that would make scoring non-deterministic.
Validators that depend on dynamic discovery (os.listdir, glob.glob, pathlib.Path) when the setup writes files with explicit names — they must reference at least one shared filename.
SQL setup scripts where CREATE TABLE column count doesn't match INSERT value count, or CSV headers that don't match row field count.
The unwinnable split pattern: .strip().split('\n') combined with randomness and a len(act) != len(exp) check.

Rejection kinds are returned so the regeneration loop can route targeted feedback (files-mismatch → targeted "open file X" hint; data-gen → "don't call random.seed" hint; etc.). Up to 3 generation attempts; a files-mismatch rejection triggers the validator repair path which regenerates only the validator with a focused prompt.

2. Preflight (`.preflight.py`)

Runs the validator in the sandbox with __name__ == "__dry_run__" (bypasses any if __name__ == "__main__" guard). Catches module-scope NameError / ImportError / ModuleNotFoundError — unrunnable validators that used to burn a whole solver attempt.

3. Validator self-test gate (`.validator_selftest.py`)

Catches internal-contradiction bugs like the canonical float('60.00%') trap, where a validator formats an expected field with a unit suffix and then calls float() on it. Flow:

_instrument_validator_for_self_test(src) — AST-find the first subprocess.run(...solution.py...) statement and prepend a probe that dumps the first-resolved expected_* variable between <<<__GHOST_SELFTEST_EXPECTED_START__>>> / <<<__GHOST_SELFTEST_EXPECTED_END__>>> sentinels, then raise SystemExit(42).
_extract_selftest_dump(stdout) — pull the dumped block out of the probe's stdout.
Write a solution.py that sys.stdout.write(<dumped>); restore mocks from the post-setup snapshot.
Run the original validator. _looks_like_validator_crash(out) returns True iff the traceback's innermost frame is .validator.py — in which case the challenge is rejected as unwinnable.

Candidate variable names scanned (in order): expected_output, expected_lines, expected, expected_text, expected_str, expected_result, golden_output, golden, correct_output, answer. The gate is best-effort: unparseable validators or those without a matching subprocess.run skip cleanly.

Runtime crash detector (widened)

Even past the three gates, a validator can raise during comparison. The attempt-loop circuit breaker classifies a traceback as a validator crash when the tail frame is .validator.py, solution.py is absent from the feedback, and the exception type is one of:

Structural: SyntaxError, IndentationError, ImportError, ModuleNotFoundError, NameError
Internal-contradiction: ValueError, TypeError, KeyError, IndexError, AttributeError

Detection aborts the cycle after attempt 1 instead of burning all 3 on the same broken validator.

Lesson pipeline

`_extract_structured_lesson`

Meta-cognitive LLM call that returns {trigger, anti_pattern, correct_pattern, domains, confidence, task/mistake/solution (legacy mirrors)}. The prompt explicitly requires task-class triggers, forbids copying fixture literals, and mandates a non-empty taxonomy domain set.

`_generalization_guard`

Last line of defence against overfit lessons. Uses n-gram token overlap (_GENERALIZATION_MIN_NGRAM = 6) to reject lessons whose:

trigger copies a 6-token run from the challenge text
correct_pattern copies a 6-token run from setup_script or the validator
domains is empty or contains nothing from _VALID_LESSON_DOMAINS ({data_analysis, regex_parse, sql, concurrency, algo, bash, python_general})
trigger or correct_pattern is empty

`_verify_lesson_helpful`

For struggled-then-won and failure cases, re-runs the solver once with the lesson prepended under the production ### SKILL PLAYBOOK: header. Keeps only if the outcome strictly improves.

Isolation markers

The isolated sub-context sets several attributes to prevent production-state writes:

ReadOnlySkillMemory.is_read_only = True — class marker that agent.py checks (via is True) at two points: (a) to skip the entire ~15 s Perfect-It follow-up LLM call during self-play, and (b) to short-circuit the confirmation turn that the solver would otherwise spend re-deriving that solution.py just ran clean. When the solver's last execute tool call exits 0 on solution.py with non-empty stdout, the turn loop sets force_stop = True and synthesises a minimal final message, skipping the ~15–25 s "task complete" thinking turn. The outer validator re-runs solution.py directly and never reads the agent's reasoning, so the confirmation turn was pure dead time.
isolated_context.args.perfect_it = False, smart_memory = 0.0, native_tools = True.
selfplay_loop_task / selfplay_loop_stop / selfplay_loop_started_at are stripped — otherwise the inner sub-agent's handle_chat would trip the outer loop's user-message interrupt hook.
verifier, uncertainty_tracker, mcts_reasoner, hypothesis_tester, frontier_tracker set to None.

Rejection prompt contract (retry attempts)

When an attempt fails validation, the retry prompt injected for the next attempt contains the validator's feedback string (the FAIL line with expected-vs-actual output) plus optional float-formatting hints. It does not contain the .validator.py source. An earlier revision pasted the full validator script into the retry prompt so the agent could "debug the validator's logic", but this turned every struggled-then-won cycle into an answer-key lookup — the agent copied the validator's constants (multipliers, SQL query shape) instead of reasoning from the expected-vs-actual diff. Skill-gate lessons from those cycles were memorised constants, not transferable knowledge.

Retry prompts now force the solver to reason from the diff and the original task description. Some complex challenges will fail their second attempt that previously "succeeded" via copying — that is the intended behaviour: a genuine failure is better training signal than a cheated pass.

Public functions

Function	Purpose
`validate_challenge_quality(setup, validator) → (bool, reason)`	Pattern-based quality gate.
`_instrument_validator_for_self_test(src) → Optional[str]`	AST probe injector (module-level, testable).
`_extract_selftest_dump(stdout) → Optional[str]`	Pull dumped expected-output from sentinel markers.
`_looks_like_validator_crash(text) → bool`	Tail-of-traceback check for `.validator.py` frame.
`detect_tool_patterns(skill_memory) → list`	Cross-episode tool-call sequence detection.
`Dreamer.synthetic_self_play(model_name, is_background)`	Full pipeline: seed → source select → gates → run → score → extract → verify → persist.
`Dreamer._try_journal_challenge(probability)`	Probabilistic journal mining; probability bumped to 0.75 under saturation.
`Dreamer._generalization_guard(lesson, …) → (bool, reason)`	Overfit-lesson rejection.
`Dreamer.dream(model_name)`	Journal → long-term consolidation (the REM path).

Concurrency

Async orchestration; the temp agent loop runs synchronously inside the simulation. Triggered by the lifespan-spawned biological watchdog, by the self_play tool (one-shot), or by self_play_loop (continuous). Validator probe / self-test / solver run on the sandbox via asyncio.to_thread.

Frontier-aware cluster selection

Before the source-selection stage, synthetic_self_play chooses which cluster to target. The default path calls FrontierTracker.pick_seed (brittle-pool weighted). When --frontier-selfplay is on (default) AND both ctx.prm_scorer is a real PRMScorer with has_model=True AND ctx.trajectory_collector is a real TrajectoryCollector (strict isinstance checks — MagicMock-backed test contexts fail closed), the dream loop instead:

Builds the candidate cluster pool as set(challenge_templates.TEMPLATES.keys()) ∪ tracker.clusters.
Computes per-cluster signals via core/frontier_selection.py: compute_cluster_uncertainty (PRM boundary-distance) and compute_cluster_rarity (log-decay of Trajectory.cluster counts).
Calls frontier_tracker.pick_frontier_seed(uncertainty_by_cluster=…, rarity_by_cluster=…, uniform_sample_prob=args.frontier_uniform_sample_prob).

Any exception in the frontier-aware block is logged at debug and falls through to pick_seed — frontier weighting must never block a self-play cycle. The selected seed's hint is appended to the challenge-generation prompt under ### FRONTIER SEED regardless of which picker produced it. The new path's hint begins with FRONTIER TARGET (PRM-weighted) so logs and tests can attribute the source.

Covered by tests/test_dream_frontier_weighted.py (real PRM + collector + tracker, mocked LLM/sandbox) and tests/test_dream_synthetic_curiosity.py (legacy path regression).

End-to-end walkthrough: see algorithms / dream cycle.