memory / frontier.py — FrontierTracker
Per-cluster self-play telemetry. Detects mastery, ranks brittleness, scaffolds difficulty tiers, and rotates away from saturated clusters.
Storage
- JSON at
memory_dir/self_play_frontier.json. - Cross-process advisory lock at
memory_dir/self_play_frontier.json.lockviafcntl; falls back to threading-only on non-POSIX.
Schema
{
"runs": [{ "timestamp", "cluster_key", "challenge", "passed", "attempts_used",
"length", "delta", "mistake" }],
"clusters": {
"<cluster_key>": {
"runs": int,
"best_length": int,
"last_length": int,
"last_compression": float,
"mastered": bool,
"recent_outcomes": [last 10 runs],
"recent_hashes": [last 20 challenge SHA1s],
"total_first_try_wins": int,
"unlocked_tier_index": int // monotonic
}
}
}
Constants
Dedup behaviour on duplicate challenges
Deterministic templates (shop.db GROUP BY, data.csv aggregation, logs/ ERROR+WARN counting, etc.) produce byte-identical challenge text on every re-roll, so every re-roll hits recent_hashes as a duplicate. The dedup path splits two concerns:
- Protect mastery counters:
runs,total_first_try_wins,best_length, andunlocked_tier_indexdo NOT advance on a duplicate. Without this guard a deterministic template would ratchet any cluster to "mastered" in a handful of re-rolls. - Preserve saturation signal: the outcome IS still appended to
recent_outcomes(taggedduplicate=True,delta=0.0) and state IS persisted. This is what lets_cluster_is_saturatedand the brittle-pool decay guard observe that the agent is now acing a template it previously struggled on.
The earlier implementation returned early on dedup without any state update. For deterministic templates the hash is stable across every re-roll, so a single struggled-then-won run would fossilise the cluster's recent_outcomes[-1] forever — no amount of subsequent clean wins could rotate the cluster out of the brittle pool. Incident 2026-04-21 11:56: 5 consecutive sql cycles all targeted the same template because the 09:06 DD struggled-then-won entry was pinned as the most-recent outcome.
Saturation detection
A cluster is saturated when _cluster_is_saturated(stats) returns True: its last SATURATION_WINDOW runs are all first-try passes with delta ≤ SATURATION_DELTA_EPSILON. Semantics: the template bank has no new learning signal for this cluster — continuing to target it burns cycles on material the agent already aces.
Saturated clusters are filtered out of _get_brittle_clusters_scored so the brittleness lottery never re-picks them. pick_seed handles the "everything saturated" case by returning mode="exploration" with a saturated_clusters list attached for the caller:
{"mode": "exploration", "cluster_key": None, "saturated_clusters": [...], "hint": "..."}
The caller (the dream loop) reads this and chooses a saturation-aware source: bumped journal probability, or a 20/80 coin-flip between pick_random_template(exclude_clusters=saturated) and LLM-gen (the majority path now falls through to LLM-gen for genuinely novel material instead of rotating to yet another deterministic template).
Brittleness scoring
brittleness = failures · 2 + hard_wins(≥3 attempts) · 2 + soft_wins(2 attempts) · 1
Saturated clusters are excluded from this score regardless of where they'd otherwise rank.
Recent-win decay guard: if the most recent run in the brittleness window is a clean first-try pass with delta ≤ SATURATION_DELTA_EPSILON, the cluster is excluded from the brittle pool even if older struggled-then-won runs in the window would otherwise score it. This prevents a single attempts=2 outlier from anchoring a cluster as "brittle" for the next 10+ cycles of clean wins — the pathology that caused a single hard-won sql challenge to keep pulling sql back into the frontier for the rest of a loop.
Tier unlocks
unlocked_tier_index only moves upward — once unlocked, a tier is sticky even after a regression. Tiers: basic, intermediate, advanced, expert. Each unlock costs TIER_UNLOCK_THRESHOLD = 3 cumulative first-try wins.
Tier drives two things in the self-play loop:
- LLM prompt hint.
get_difficulty_hint(cluster)returns a one-liner fromDIFFICULTY_HINTSdescribing the tier's complexity floor. The dream loop injects this into the challenge-generation prompt when the LLM path is used. - Deterministic template scaling.
get_difficulty_tier(cluster)is piped intotry_template/pick_random_templateas a resolver, and each template scales its problem size (1× / 2× / 3× / 4×) and activates its hard-mode twist (NA rows, malformed lines, stopwords, NULL columns, extra log levels, expert concurrency variants) when tier isadvancedorexpert.
Before this wiring, the tier was purely cosmetic: the template bank always rendered at basic and the agent 1-shot every cluster in production. Tier is now the mechanism that turns a first-try win at basic into a harder challenge at the next cycle, so a cluster can accumulate real mastery signal rather than saturating on a fixed shape.
Frontier-aware cluster selection (PRM-weighted)
Brittleness scoring sees outcomes but not coverage: a cluster the agent has barely tried looks identical to a cluster it solves first-try, because both have no recent failures. pick_frontier_seed is the extension that splits these by combining two complementary signals computed in the pure-function layer at core/frontier_selection.py:
- PRM uncertainty —
PRMScorer.uncertainty(state, action)as1 − 2·|p − 0.5|. A representativePlanStateper cluster ("solve a {cluster} challenge") is scored; clusters where the PRM has no opinion (untrained, or genuinely at the decision boundary) score near 1.0. - Trajectory rarity —
1 / (1 + log1p(count))overTrajectory.clustergroupings fromTrajectoryCollector.iter_trajectories(). Smooth, bounded in (0, 1]; well-explored clusters decay slowly so a high-uncertainty veteran cluster can still be picked.
The two are multiplied: a cluster needs both "we don't know much about it" AND "we don't have many examples" to win. Saturated clusters (per list_saturated_clusters()) are excluded with weight 0. pick_weighted samples in proportion.
Three transparent fallbacks restore the legacy pick_seed behaviour without behavioural drift:
- Empty signals — caller passed no uncertainty / no rarity (PRM untrained AND trajectory store empty). Preserves existing behaviour at cold-boot.
- Uniform-sample sanity floor — with probability
uniform_sample_prob(default 0.2, exposed as--frontier-uniform-sample-prob), the picker bypasses frontier weighting and callspick_seeddirectly. The PRM is itself learned from trajectories the self-play loop produces; without this floor a cold bias could self-reinforce onto a single cluster. - All weights zero — every non-saturated cluster has one signal at 0 (e.g., NaN crept in, or every candidate was excluded). Falls back rather than picking nothing.
Returned dict mirrors pick_seed's shape so call sites need no schema branching, plus extra fields for inspection:
{
"mode": "frontier_weighted",
"cluster_key": "...",
"difficulty_tier": "...",
"saturated_clusters": [...],
"weight": float, # combined uncertainty × rarity
"uncertainty": float,
"rarity": float,
"hint": "FRONTIER TARGET (PRM-weighted): ..."
}
The hint string always begins with FRONTIER TARGET (PRM-weighted) so logs can distinguish it from the brittle-pool path. Fallback seeds carry an extra frontier_fallback key ("uniform_sample" or "no_positive_weight") for log attribution.
Gate conditions in Dreamer.synthetic_self_play use isinstance() rather than truthiness because MagicMock-backed test contexts return Mocks for any attribute access — type-checks fail closed for both ctx.prm_scorer and ctx.trajectory_collector.
Methods
| Method | Purpose |
|---|---|
get_cluster_stats(key) | Stats blob. |
get_brittle_clusters(limit) | Top brittle clusters by score (saturated clusters filtered out). |
get_difficulty_tier(key) | basic / intermediate / advanced / expert. |
get_difficulty_hint(key) | Tier-specific prompt hint. |
_cluster_is_saturated(stats) → bool (classmethod) | Check if last N runs are all trivial first-try wins. |
list_saturated_clusters() → list | All currently-saturated cluster keys. |
pick_seed(random_explore_prob=0.2) | Choose next cluster + hint for self-play (brittle-pool path). Returns exploration mode with saturated_clusters when all brittle candidates are saturated. Callers typically pass random_explore_prob=0.35 (dream loop default) so even non-saturated picks get breathing room. |
pick_frontier_seed(uncertainty_by_cluster, rarity_by_cluster, uniform_sample_prob=0.2, random_explore_prob=0.35) | Frontier-weighted alternative. Combines PRM uncertainty + trajectory rarity, excludes saturated clusters, falls back to pick_seed on empty signals / sanity-sample roll / all-zero weights. Returns the same dict shape as pick_seed with mode="frontier_weighted". |
record_run(cluster_key, challenge, attempts_used, passed, description_length, mistake) → dict | Log a run; returns {compression_delta, mastered, is_new_cluster, …}. |
adaptive_cooldown(base, floor, ceiling, cluster_key) | Cooldown seconds based on recent progress. Used by both the biological watchdog (minutes–hours) and self_play_loop (5–180 s clamped). |
Tests
Frontier-tracker invariants are pinned by tests/test_frontier_tracker.py; the new frontier-weighted picker is covered by tests/test_frontier_pick_frontier_seed.py (fallback ladder + dict-shape contract) and the dream integration by tests/test_dream_frontier_weighted.py (real PRM + real TrajectoryCollector, mocked LLM/sandbox).