Dream cycle & self-play
How idle time turns failures into curriculum and lessons into pinned skills.
Triggers
Self-play runs under three distinct triggers:
- Idle — the biological watchdog detects >60 min user inactivity (foreground LLM tasks = 0) and spawns a single cycle with adaptive cooldown.
- Manual one-shot — the user says "run self play"; the LLM calls the
self_playtool and runs exactly one cycle. - Continuous loop — the user says "run self play in a loop"; the LLM calls
self_play_loopwhich spawns a backgroundasyncio.Taskcycling back-to-back until the user sends any message (which the handle_chat interrupt hook converts into a stop signal) or the LLM callsstop_self_play. Cool-off between cycles is adaptive (5–180 s) via FrontierTracker.adaptive_cooldown. Not persisted across restarts.
Figure 11 — Dream cycle with full validation stack, saturation routing, and continuous-loop consolidation. Step [b] runs pick_frontier_seed (PRM-weighted) when the gating conditions hold; otherwise legacy pick_seed.
Frontier-aware cluster selection (PRM-weighted)
Before the saturation / rotation logic, the dream loop chooses which cluster to target. The default path uses FrontierTracker.pick_seed — brittle-pool weighted, tracking only outcomes. When --frontier-selfplay is on (default), and ctx.prm_scorer is a real PRMScorer with has_model=True, and ctx.trajectory_collector is a real TrajectoryCollector (strict isinstance checks — MagicMock-backed test contexts fail closed), the picker swaps to pick_frontier_seed which combines two complementary signals:
- PRM uncertainty via
compute_cluster_uncertainty— boundary distance1 − 2·|p − 0.5|against a synthetic representative state per cluster. High when the PRM has no opinion or is genuinely at the boundary. - Trajectory rarity via
compute_cluster_rarity—1 / (1 + log1p(count))fromTrajectory.clustergroupings. Smooth, bounded, log-decay so well-explored clusters still get weight when uncertainty is high.
Signals are multiplied; saturated clusters get weight 0. Three transparent fallbacks restore the legacy pick_seed behaviour: empty signals (cold boot), a uniform-sample sanity roll (default 20%, exposed as --frontier-uniform-sample-prob), or all-zero combined weights. Any exception in the new path is logged at debug and falls through — frontier weighting must never block a self-play cycle.
Why the sanity floor matters: the PRM is itself learned from trajectories the self-play loop produces. Without the uniform-sample bypass, a cold or systematically-wrong PRM could lock self-play onto a single cluster and starve other clusters of new training signal, which would then keep the PRM wrong about them — a self-reinforcing failure mode. 20% uniform sampling breaks that loop without losing the benefit of frontier targeting on the other 80%.
The frontier-weighted seed's hint always begins with FRONTIER TARGET (PRM-weighted) so the self-play log can attribute the source. Covered by tests/test_dream_frontier_weighted.py (4 cases) and tests/test_frontier_pick_frontier_seed.py (9 cases).
Frontier saturation & rotation
A cluster is saturated when its last SATURATION_WINDOW = 2 runs are all first-try passes with near-zero compression delta (<= 0.001). A saturated cluster is producing no learning signal — continuing to target it burns cycles on material the agent already aces. (Window lowered from 3 → 2 after observing a 13-cycle self-play loop that produced 0 net lessons; at 2 the cluster rotates out one clean cycle after a struggle instead of three.)
A companion recent-win decay guard in _get_brittle_clusters_scored prevents a single stale struggled-then-won run (attempts_used=2, positive delta) from anchoring a cluster in the brittle pool once the cluster has stabilised. If the most recent run in the brittleness window is a clean first-try pass with near-zero delta, older struggles in the window no longer score — the cluster is treated as recovered even if its full window still shows some historical friction.
When pick_seed finds every brittle candidate saturated (and the caller isn't rolling the 35% random-exploration dice), it returns mode="exploration" with the saturated list attached. The dream loop then branches:
- Journal mining probability boost:
_try_journal_challengeis called withprobability=0.75instead of the default 0.25. Journal-mined challenges come from real user post-mortems — the richest source of novel, struggle-inducing material. - 20/80 coin-flip (was 50/50, rebalanced toward novelty): if journal mining misses, the loop flips a coin. 20% →
pick_random_template(exclude_clusters=saturated)so the expert concurrency / algo / regex-parse templates still get airtime. 80% → fall through to LLM-generated challenges, with a diversity-requirement prompt injection telling the generator to pick from{concurrency, algo, regex_parse, sql, bash}and explicitly forbidding another data-analysis / CSV groupby. The rebalance reflects that deterministic templates are now primarily regression-tests rather than training signal — novel LLM-gen shapes are where real learning happens.
Tier-aware template scaling
Once a cluster has been picked (frontier, saturated-rotation, or cold-start), the template fast path renders at the cluster's difficulty tier. The dream loop builds a _resolve_tier closure over FrontierTracker.get_difficulty_tier and passes it to both try_template(cluster, tier=...) and pick_random_template(tier_resolver=...). Each template then:
- Multiplies its base problem size (rows / files / tokens) by
_TIER_SIZE_MULTIPLIER— 1× at basic, 4× at expert. - Activates a cluster-specific hard-mode twist at
advanced/expert: NA rows in data_analysis, malformed lines in regex_parse, a stopword set in python_general, distinct-k in algo, NULL columns in sql, a third log level (FATAL) in bash, a tighter variant pool in concurrency.
The twist is always described in the challenge prompt — tier escalation is a curriculum, not a trap. Before this wiring (see incident in the 2026-04-22 self-play log: 163 SUCCESS vs 6 FAILURE, almost all first-try), the tier machinery was purely cosmetic: templates always rendered at basic and a Qwen-sized model 1-shot every cluster. The tier is now the mechanism that makes mastery accumulation possible — once a cluster unlocks advanced, subsequent cycles render harder fixtures that a basic-tier solution will fail on.
Validator self-test gate
LLM-generated validators sometimes crash on their own expected output — the canonical bug is formatting an expected field with a % suffix, then calling float() on it. The self-test gate catches this before the solver wastes 3 attempts on an unwinnable challenge.
- AST-parse the validator; locate the first top-level statement calling
subprocess.run(...solution.py...). - Inject a probe right before that statement: dump the first resolved
expected_*variable (expected_output,expected_lines,expected,golden_output,answer, …) between sentinel markers, thenraise SystemExit(42). - Run the instrumented probe in the sandbox; extract the dumped block.
- Write a
solution.pythat echoes that block verbatim. - Run the original validator. A correct validator must exit 0 on its own expected output; a self-contradicting one crashes and gets rejected.
The gate is best-effort: unparseable validators, validators missing subprocess.run(solution.py), or validators that don't use any of the candidate variable names are skipped rather than blocked. False negatives are preferred over false positives.
Widened runtime crash detector
Even past the self-test, a validator may raise at the comparison line (e.g. ValueError on a formatted field, KeyError on a missing column). The runtime circuit breaker now treats the following as validator crashes when the top traceback frame is .validator.py and solution.py is not mentioned:
- Structural:
SyntaxError,IndentationError,ImportError,ModuleNotFoundError,NameError. - Internal contradiction (widened in the validator-self-test redesign):
ValueError,TypeError,KeyError,IndexError,AttributeError.
Detection aborts the cycle after attempt 1 instead of burning all 3 on the same broken validator.
Lesson extraction pipeline
After a successful cycle, the Dreamer extracts a structured lesson (trigger, anti-pattern, correct-pattern, domains, confidence, source_challenge_hash) via a meta-cognitive LLM call. Before writing to the playbook, two gates apply:
_generalization_guard— rejects overfit lessons. Uses n-gram overlap (_GENERALIZATION_MIN_NGRAM = 6) to catch:- Triggers that restate the synthetic challenge verbatim
correct_patterns that copy ≥6 consecutive tokens fromsetup_scriptor the validator- Empty or off-taxonomy domains (required non-empty subset of
{data_analysis, regex_parse, sql, concurrency, algo, bash, python_general})
_verify_lesson_helpful— for struggled-then-won / failure cases, re-runs the solver once with the lesson prepended under the production### SKILL PLAYBOOK:header. Keeps only if the outcome strictly improves (original-fail → verify-pass, or original ≥ 2 attempts → verify on attempt 1).
Isolation boundary
The temp sub-agent runs on an isolated copy of the real context. All memory wrappers are read-only:
ReadOnlyVectorMemory—add()/smart_update()/delete()are no-ops.ReadOnlySkillMemory— carriesis_read_only = Trueas a class marker so sub-agents skip expensive write-oriented paths entirely.ReadOnlyGraphMemory,journal = None,profile_memory = None,frontier_tracker = None.- Self-play loop handles (
selfplay_loop_task,selfplay_loop_stop) are stripped from the isolated context — otherwise the inner sub-agent's ownhandle_chatcall would trigger the user-message interrupt hook and kill the outer loop after its first cycle. - Perfect-It protocol — gated on
getattr(ctx.skill_memory, "is_read_only", False) is True. Skipped during self-play so the ~15 s follow-up LLM call doesn't burn budget producing an optimisation suggestion that would land in a no-op wrapper.
Continuous-loop consolidation
In self_play_loop the inter-cycle boundary is the predictable window to drain the short-term journal via context.agent.process_journal_queue(). Without this explicit drain, the biological watchdog's 60 s tick may or may not land between long-running cycles, leading to dozens of buffered items piling up on hippocampus. The helper is a cheap no-op when the journal is empty and catches its own errors, so consolidation hiccups never kill the loop.
Important guards (reference)
- validate_challenge_quality — rejects challenges whose validator depends on randomness, references no setup files, or has SQL schema / column-count mismatches. Returns rejection kind so the regeneration loop can give targeted feedback.
- correctness_weighted_score + three-tier
count_tool_errors— wrong-but-compressed answers get zero credit; fixture reads aren't mistaken for errors. - FrontierTracker keeps a
recent_hashesdedup list per cluster (window 20). Duplicates don't bump mastery counters (runs,total_first_try_wins,best_length) but DO append torecent_outcomesso saturation detection and the decay guard can see pattern changes on deterministic-template re-rolls — an earlier implementation discarded the append and fossilised clusters in the brittle pool forever. - pick_random_template maintains a module-level
_LAST_TEMPLATE_KEYanchor so the sampler doesn't pick the same template twice in a row.