Dream cycle & self-play

How idle time turns failures into curriculum and lessons into pinned skills.

Triggers

Self-play runs under three distinct triggers:

Idle — the biological watchdog detects >60 min user inactivity (foreground LLM tasks = 0) and spawns a single cycle with adaptive cooldown.
Manual one-shot — the user says "run self play"; the LLM calls the self_play tool and runs exactly one cycle.
Continuous loop — the user says "run self play in a loop"; the LLM calls self_play_loop which spawns a background asyncio.Task cycling back-to-back until the user sends any message (which the handle_chat interrupt hook converts into a stop signal) or the LLM calls stop_self_play. Cool-off between cycles is adaptive (5–180 s) via FrontierTracker.adaptive_cooldown. Not persisted across restarts.

Figure 11 — Dream cycle with full validation stack, saturation routing, and continuous-loop consolidation. Step [b] runs pick_frontier_seed (PRM-weighted) when the gating conditions hold; otherwise legacy pick_seed.

Frontier-aware cluster selection (PRM-weighted)

Before the saturation / rotation logic, the dream loop chooses which cluster to target. The default path uses FrontierTracker.pick_seed — brittle-pool weighted, tracking only outcomes. When --frontier-selfplay is on (default), and ctx.prm_scorer is a real PRMScorer with has_model=True, and ctx.trajectory_collector is a real TrajectoryCollector (strict isinstance checks — MagicMock-backed test contexts fail closed), the picker swaps to pick_frontier_seed which combines two complementary signals:

PRM uncertainty via compute_cluster_uncertainty — boundary distance 1 − 2·|p − 0.5| against a synthetic representative state per cluster. High when the PRM has no opinion or is genuinely at the boundary.
Trajectory rarity via compute_cluster_rarity — 1 / (1 + log1p(count)) from Trajectory.cluster groupings. Smooth, bounded, log-decay so well-explored clusters still get weight when uncertainty is high.

Signals are multiplied; saturated clusters get weight 0. Three transparent fallbacks restore the legacy pick_seed behaviour: empty signals (cold boot), a uniform-sample sanity roll (default 20%, exposed as --frontier-uniform-sample-prob), or all-zero combined weights. Any exception in the new path is logged at debug and falls through — frontier weighting must never block a self-play cycle.

Why the sanity floor matters: the PRM is itself learned from trajectories the self-play loop produces. Without the uniform-sample bypass, a cold or systematically-wrong PRM could lock self-play onto a single cluster and starve other clusters of new training signal, which would then keep the PRM wrong about them — a self-reinforcing failure mode. 20% uniform sampling breaks that loop without losing the benefit of frontier targeting on the other 80%.

The frontier-weighted seed's hint always begins with FRONTIER TARGET (PRM-weighted) so the self-play log can attribute the source. Covered by tests/test_dream_frontier_weighted.py (4 cases) and tests/test_frontier_pick_frontier_seed.py (9 cases).

Frontier saturation & rotation

A cluster is saturated when its last SATURATION_WINDOW = 2 runs are all first-try passes with near-zero compression delta (<= 0.001). A saturated cluster is producing no learning signal — continuing to target it burns cycles on material the agent already aces. (Window lowered from 3 → 2 after observing a 13-cycle self-play loop that produced 0 net lessons; at 2 the cluster rotates out one clean cycle after a struggle instead of three.)

A companion recent-win decay guard in _get_brittle_clusters_scored prevents a single stale struggled-then-won run (attempts_used=2, positive delta) from anchoring a cluster in the brittle pool once the cluster has stabilised. If the most recent run in the brittleness window is a clean first-try pass with near-zero delta, older struggles in the window no longer score — the cluster is treated as recovered even if its full window still shows some historical friction.

When pick_seed finds every brittle candidate saturated (and the caller isn't rolling the 35% random-exploration dice), it returns mode="exploration" with the saturated list attached. The dream loop then branches:

Journal mining probability boost: _try_journal_challenge is called with probability=0.75 instead of the default 0.25. Journal-mined challenges come from real user post-mortems — the richest source of novel, struggle-inducing material.
20/80 coin-flip (was 50/50, rebalanced toward novelty): if journal mining misses, the loop flips a coin. 20% → pick_random_template(exclude_clusters=saturated) so the expert concurrency / algo / regex-parse templates still get airtime. 80% → fall through to LLM-generated challenges, with a diversity-requirement prompt injection telling the generator to pick from {concurrency, algo, regex_parse, sql, bash} and explicitly forbidding another data-analysis / CSV groupby. The rebalance reflects that deterministic templates are now primarily regression-tests rather than training signal — novel LLM-gen shapes are where real learning happens.

Tier-aware template scaling

Once a cluster has been picked (frontier, saturated-rotation, or cold-start), the template fast path renders at the cluster's difficulty tier. The dream loop builds a _resolve_tier closure over FrontierTracker.get_difficulty_tier and passes it to both try_template(cluster, tier=...) and pick_random_template(tier_resolver=...). Each template then:

Multiplies its base problem size (rows / files / tokens) by _TIER_SIZE_MULTIPLIER — 1× at basic, 4× at expert.
Activates a cluster-specific hard-mode twist at advanced / expert: NA rows in data_analysis, malformed lines in regex_parse, a stopword set in python_general, distinct-k in algo, NULL columns in sql, a third log level (FATAL) in bash, a tighter variant pool in concurrency.

The twist is always described in the challenge prompt — tier escalation is a curriculum, not a trap. Before this wiring (see incident in the 2026-04-22 self-play log: 163 SUCCESS vs 6 FAILURE, almost all first-try), the tier machinery was purely cosmetic: templates always rendered at basic and a Qwen-sized model 1-shot every cluster. The tier is now the mechanism that makes mastery accumulation possible — once a cluster unlocks advanced, subsequent cycles render harder fixtures that a basic-tier solution will fail on.

Validator self-test gate

LLM-generated validators sometimes crash on their own expected output — the canonical bug is formatting an expected field with a % suffix, then calling float() on it. The self-test gate catches this before the solver wastes 3 attempts on an unwinnable challenge.

AST-parse the validator; locate the first top-level statement calling subprocess.run(...solution.py...).
Inject a probe right before that statement: dump the first resolved expected_* variable (expected_output, expected_lines, expected, golden_output, answer, …) between sentinel markers, then raise SystemExit(42).
Run the instrumented probe in the sandbox; extract the dumped block.
Write a solution.py that echoes that block verbatim.
Run the original validator. A correct validator must exit 0 on its own expected output; a self-contradicting one crashes and gets rejected.

The gate is best-effort: unparseable validators, validators missing subprocess.run(solution.py), or validators that don't use any of the candidate variable names are skipped rather than blocked. False negatives are preferred over false positives.

Widened runtime crash detector

Even past the self-test, a validator may raise at the comparison line (e.g. ValueError on a formatted field, KeyError on a missing column). The runtime circuit breaker now treats the following as validator crashes when the top traceback frame is .validator.py and solution.py is not mentioned:

Structural: SyntaxError, IndentationError, ImportError, ModuleNotFoundError, NameError.
Internal contradiction (widened in the validator-self-test redesign): ValueError, TypeError, KeyError, IndexError, AttributeError.

Detection aborts the cycle after attempt 1 instead of burning all 3 on the same broken validator.

Lesson extraction pipeline

After a successful cycle, the Dreamer extracts a structured lesson (trigger, anti-pattern, correct-pattern, domains, confidence, source_challenge_hash) via a meta-cognitive LLM call. Before writing to the playbook, two gates apply:

_generalization_guard — rejects overfit lessons. Uses n-gram overlap (_GENERALIZATION_MIN_NGRAM = 6) to catch:
- Triggers that restate the synthetic challenge verbatim
- correct_patterns that copy ≥6 consecutive tokens from setup_script or the validator
- Empty or off-taxonomy domains (required non-empty subset of {data_analysis, regex_parse, sql, concurrency, algo, bash, python_general})
_verify_lesson_helpful — for struggled-then-won / failure cases, re-runs the solver once with the lesson prepended under the production ### SKILL PLAYBOOK: header. Keeps only if the outcome strictly improves (original-fail → verify-pass, or original ≥ 2 attempts → verify on attempt 1).

Isolation boundary

The temp sub-agent runs on an isolated copy of the real context. All memory wrappers are read-only:

ReadOnlyVectorMemory — add() / smart_update() / delete() are no-ops.
ReadOnlySkillMemory — carries is_read_only = True as a class marker so sub-agents skip expensive write-oriented paths entirely.
ReadOnlyGraphMemory, journal = None, profile_memory = None, frontier_tracker = None.
Self-play loop handles (selfplay_loop_task, selfplay_loop_stop) are stripped from the isolated context — otherwise the inner sub-agent's own handle_chat call would trigger the user-message interrupt hook and kill the outer loop after its first cycle.
Perfect-It protocol — gated on getattr(ctx.skill_memory, "is_read_only", False) is True. Skipped during self-play so the ~15 s follow-up LLM call doesn't burn budget producing an optimisation suggestion that would land in a no-op wrapper.

Continuous-loop consolidation

In self_play_loop the inter-cycle boundary is the predictable window to drain the short-term journal via context.agent.process_journal_queue(). Without this explicit drain, the biological watchdog's 60 s tick may or may not land between long-running cycles, leading to dozens of buffered items piling up on hippocampus. The helper is a cheap no-op when the journal is empty and catches its own errors, so consolidation hiccups never kill the loop.

Important guards (reference)

validate_challenge_quality — rejects challenges whose validator depends on randomness, references no setup files, or has SQL schema / column-count mismatches. Returns rejection kind so the regeneration loop can give targeted feedback.
correctness_weighted_score + three-tier count_tool_errors — wrong-but-compressed answers get zero credit; fixture reads aren't mistaken for errors.
FrontierTracker keeps a recent_hashes dedup list per cluster (window 20). Duplicates don't bump mastery counters (runs, total_first_try_wins, best_length) but DO append to recent_outcomes so saturation detection and the decay guard can see pattern changes on deterministic-template re-rolls — an earlier implementation discarded the append and fossilised clusters in the brittle pool forever.
pick_random_template maintains a module-level _LAST_TEMPLATE_KEY anchor so the sampler doesn't pick the same template twice in a row.