core / dream.py

Idle-time consolidation, synthetic challenge generation, and the full self-play orchestration pipeline.

Responsibilities

Validation stack (three gates, in order)

1. validate_challenge_quality(setup, validator) → (bool, reason)

Pattern-based quality gate (line 102). Rejects:

Rejection kinds are returned so the regeneration loop can route targeted feedback (files-mismatch → targeted "open file X" hint; data-gen → "don't call random.seed" hint; etc.). Up to 3 generation attempts; a files-mismatch rejection triggers the validator repair path which regenerates only the validator with a focused prompt.

2. Preflight (.preflight.py)

Runs the validator in the sandbox with __name__ == "__dry_run__" (bypasses any if __name__ == "__main__" guard). Catches module-scope NameError / ImportError / ModuleNotFoundError — unrunnable validators that used to burn a whole solver attempt.

3. Validator self-test gate (.validator_selftest.py)

Catches internal-contradiction bugs like the canonical float('60.00%') trap, where a validator formats an expected field with a unit suffix and then calls float() on it. Flow:

  1. _instrument_validator_for_self_test(src) — AST-find the first subprocess.run(...solution.py...) statement and prepend a probe that dumps the first-resolved expected_* variable between <<<__GHOST_SELFTEST_EXPECTED_START__>>> / <<<__GHOST_SELFTEST_EXPECTED_END__>>> sentinels, then raise SystemExit(42).
  2. _extract_selftest_dump(stdout) — pull the dumped block out of the probe's stdout.
  3. Write a solution.py that sys.stdout.write(<dumped>); restore mocks from the post-setup snapshot.
  4. Run the original validator. _looks_like_validator_crash(out) returns True iff the traceback's innermost frame is .validator.py — in which case the challenge is rejected as unwinnable.

Candidate variable names scanned (in order): expected_output, expected_lines, expected, expected_text, expected_str, expected_result, golden_output, golden, correct_output, answer. The gate is best-effort: unparseable validators or those without a matching subprocess.run skip cleanly.

Runtime crash detector (widened)

Even past the three gates, a validator can raise during comparison. The attempt-loop circuit breaker classifies a traceback as a validator crash when the tail frame is .validator.py, solution.py is absent from the feedback, and the exception type is one of:

Detection aborts the cycle after attempt 1 instead of burning all 3 on the same broken validator.

Lesson pipeline

_extract_structured_lesson

Meta-cognitive LLM call that returns {trigger, anti_pattern, correct_pattern, domains, confidence, task/mistake/solution (legacy mirrors)}. The prompt explicitly requires task-class triggers, forbids copying fixture literals, and mandates a non-empty taxonomy domain set.

_generalization_guard

Last line of defence against overfit lessons. Uses n-gram token overlap (_GENERALIZATION_MIN_NGRAM = 6) to reject lessons whose:

_verify_lesson_helpful

For struggled-then-won and failure cases, re-runs the solver once with the lesson prepended under the production ### SKILL PLAYBOOK: header. Keeps only if the outcome strictly improves.

Isolation markers

The isolated sub-context sets several attributes to prevent production-state writes:

Rejection prompt contract (retry attempts)

When an attempt fails validation, the retry prompt injected for the next attempt contains the validator's feedback string (the FAIL line with expected-vs-actual output) plus optional float-formatting hints. It does not contain the .validator.py source. An earlier revision pasted the full validator script into the retry prompt so the agent could "debug the validator's logic", but this turned every struggled-then-won cycle into an answer-key lookup — the agent copied the validator's constants (multipliers, SQL query shape) instead of reasoning from the expected-vs-actual diff. Skill-gate lessons from those cycles were memorised constants, not transferable knowledge.

Retry prompts now force the solver to reason from the diff and the original task description. Some complex challenges will fail their second attempt that previously "succeeded" via copying — that is the intended behaviour: a genuine failure is better training signal than a cheated pass.

Public functions

FunctionPurpose
validate_challenge_quality(setup, validator) → (bool, reason)Pattern-based quality gate.
_instrument_validator_for_self_test(src) → Optional[str]AST probe injector (module-level, testable).
_extract_selftest_dump(stdout) → Optional[str]Pull dumped expected-output from sentinel markers.
_looks_like_validator_crash(text) → boolTail-of-traceback check for .validator.py frame.
detect_tool_patterns(skill_memory) → listCross-episode tool-call sequence detection.
Dreamer.synthetic_self_play(model_name, is_background)Full pipeline: seed → source select → gates → run → score → extract → verify → persist.
Dreamer._try_journal_challenge(probability)Probabilistic journal mining; probability bumped to 0.75 under saturation.
Dreamer._generalization_guard(lesson, …) → (bool, reason)Overfit-lesson rejection.
Dreamer.dream(model_name)Journal → long-term consolidation (the REM path).

Concurrency

Async orchestration; the temp agent loop runs synchronously inside the simulation. Triggered by the lifespan-spawned biological watchdog, by the self_play tool (one-shot), or by self_play_loop (continuous). Validator probe / self-test / solver run on the sandbox via asyncio.to_thread.

Frontier-aware cluster selection

Before the source-selection stage, synthetic_self_play chooses which cluster to target. The default path calls FrontierTracker.pick_seed (brittle-pool weighted). When --frontier-selfplay is on (default) AND both ctx.prm_scorer is a real PRMScorer with has_model=True AND ctx.trajectory_collector is a real TrajectoryCollector (strict isinstance checks — MagicMock-backed test contexts fail closed), the dream loop instead:

  1. Builds the candidate cluster pool as set(challenge_templates.TEMPLATES.keys()) ∪ tracker.clusters.
  2. Computes per-cluster signals via core/frontier_selection.py: compute_cluster_uncertainty (PRM boundary-distance) and compute_cluster_rarity (log-decay of Trajectory.cluster counts).
  3. Calls frontier_tracker.pick_frontier_seed(uncertainty_by_cluster=…, rarity_by_cluster=…, uniform_sample_prob=args.frontier_uniform_sample_prob).

Any exception in the frontier-aware block is logged at debug and falls through to pick_seed — frontier weighting must never block a self-play cycle. The selected seed's hint is appended to the challenge-generation prompt under ### FRONTIER SEED regardless of which picker produced it. The new path's hint begins with FRONTIER TARGET (PRM-weighted) so logs and tests can attribute the source.

Covered by tests/test_dream_frontier_weighted.py (real PRM + collector + tracker, mocked LLM/sandbox) and tests/test_dream_synthetic_curiosity.py (legacy path regression).

End-to-end walkthrough: see algorithms / dream cycle.