memory / frontier.py — FrontierTracker

Per-cluster self-play telemetry. Detects mastery, ranks brittleness, scaffolds difficulty tiers, and rotates away from saturated clusters.

Storage

JSON at memory_dir/self_play_frontier.json.
Cross-process advisory lock at memory_dir/self_play_frontier.json.lock via fcntl; falls back to threading-only on non-POSIX.

Schema

{
  "runs": [{ "timestamp", "cluster_key", "challenge", "passed", "attempts_used",
             "length", "delta", "mistake" }],
  "clusters": {
    "<cluster_key>": {
      "runs": int,
      "best_length": int,
      "last_length": int,
      "last_compression": float,
      "mastered": bool,
      "recent_outcomes": [last 10 runs],
      "recent_hashes":   [last 20 challenge SHA1s],
      "total_first_try_wins": int,
      "unlocked_tier_index": int  // monotonic
    }
  }
}

Constants

MAX_RUNS

200 (tail evict)

DEDUP_WINDOW

20 challenge hashes — used to detect duplicate challenge texts (common on deterministic templates)

MASTERED_STREAK

5 consecutive first-try wins + ≥1 run with delta > 0.05

BRITTLE_WINDOW

3 — rolling window for brittleness scoring

SATURATION_WINDOW

2 — last N runs all first-try wins with near-zero delta (lowered from 3 after a 13-cycle loop produced 0 net lessons; at 2 the cluster rotates out one clean cycle after a struggle)

SATURATION_DELTA_EPSILON

0.001

Dedup behaviour on duplicate challenges

Deterministic templates (shop.db GROUP BY, data.csv aggregation, logs/ ERROR+WARN counting, etc.) produce byte-identical challenge text on every re-roll, so every re-roll hits recent_hashes as a duplicate. The dedup path splits two concerns:

Protect mastery counters: runs, total_first_try_wins, best_length, and unlocked_tier_index do NOT advance on a duplicate. Without this guard a deterministic template would ratchet any cluster to "mastered" in a handful of re-rolls.
Preserve saturation signal: the outcome IS still appended to recent_outcomes (tagged duplicate=True, delta=0.0) and state IS persisted. This is what lets _cluster_is_saturated and the brittle-pool decay guard observe that the agent is now acing a template it previously struggled on.

The earlier implementation returned early on dedup without any state update. For deterministic templates the hash is stable across every re-roll, so a single struggled-then-won run would fossilise the cluster's recent_outcomes[-1] forever — no amount of subsequent clean wins could rotate the cluster out of the brittle pool. Incident 2026-04-21 11:56: 5 consecutive sql cycles all targeted the same template because the 09:06 DD struggled-then-won entry was pinned as the most-recent outcome.

Saturation detection

A cluster is saturated when _cluster_is_saturated(stats) returns True: its last SATURATION_WINDOW runs are all first-try passes with delta ≤ SATURATION_DELTA_EPSILON. Semantics: the template bank has no new learning signal for this cluster — continuing to target it burns cycles on material the agent already aces.

Saturated clusters are filtered out of _get_brittle_clusters_scored so the brittleness lottery never re-picks them. pick_seed handles the "everything saturated" case by returning mode="exploration" with a saturated_clusters list attached for the caller:

{"mode": "exploration", "cluster_key": None, "saturated_clusters": [...], "hint": "..."}

The caller (the dream loop) reads this and chooses a saturation-aware source: bumped journal probability, or a 20/80 coin-flip between pick_random_template(exclude_clusters=saturated) and LLM-gen (the majority path now falls through to LLM-gen for genuinely novel material instead of rotating to yet another deterministic template).

Brittleness scoring

brittleness = failures · 2  +  hard_wins(≥3 attempts) · 2  +  soft_wins(2 attempts) · 1

Saturated clusters are excluded from this score regardless of where they'd otherwise rank.

Recent-win decay guard: if the most recent run in the brittleness window is a clean first-try pass with delta ≤ SATURATION_DELTA_EPSILON, the cluster is excluded from the brittle pool even if older struggled-then-won runs in the window would otherwise score it. This prevents a single attempts=2 outlier from anchoring a cluster as "brittle" for the next 10+ cycles of clean wins — the pathology that caused a single hard-won sql challenge to keep pulling sql back into the frontier for the rest of a loop.

Tier unlocks

unlocked_tier_index only moves upward — once unlocked, a tier is sticky even after a regression. Tiers: basic, intermediate, advanced, expert. Each unlock costs TIER_UNLOCK_THRESHOLD = 3 cumulative first-try wins.

Tier drives two things in the self-play loop:

LLM prompt hint. get_difficulty_hint(cluster) returns a one-liner from DIFFICULTY_HINTS describing the tier's complexity floor. The dream loop injects this into the challenge-generation prompt when the LLM path is used.
Deterministic template scaling. get_difficulty_tier(cluster) is piped into try_template / pick_random_template as a resolver, and each template scales its problem size (1× / 2× / 3× / 4×) and activates its hard-mode twist (NA rows, malformed lines, stopwords, NULL columns, extra log levels, expert concurrency variants) when tier is advanced or expert.

Before this wiring, the tier was purely cosmetic: the template bank always rendered at basic and the agent 1-shot every cluster in production. Tier is now the mechanism that turns a first-try win at basic into a harder challenge at the next cycle, so a cluster can accumulate real mastery signal rather than saturating on a fixed shape.

Frontier-aware cluster selection (PRM-weighted)

Brittleness scoring sees outcomes but not coverage: a cluster the agent has barely tried looks identical to a cluster it solves first-try, because both have no recent failures. pick_frontier_seed is the extension that splits these by combining two complementary signals computed in the pure-function layer at core/frontier_selection.py:

PRM uncertainty — PRMScorer.uncertainty(state, action) as 1 − 2·|p − 0.5|. A representative PlanState per cluster ("solve a {cluster} challenge") is scored; clusters where the PRM has no opinion (untrained, or genuinely at the decision boundary) score near 1.0.
Trajectory rarity — 1 / (1 + log1p(count)) over Trajectory.cluster groupings from TrajectoryCollector.iter_trajectories(). Smooth, bounded in (0, 1]; well-explored clusters decay slowly so a high-uncertainty veteran cluster can still be picked.

The two are multiplied: a cluster needs both "we don't know much about it" AND "we don't have many examples" to win. Saturated clusters (per list_saturated_clusters()) are excluded with weight 0. pick_weighted samples in proportion.

Three transparent fallbacks restore the legacy pick_seed behaviour without behavioural drift:

Empty signals — caller passed no uncertainty / no rarity (PRM untrained AND trajectory store empty). Preserves existing behaviour at cold-boot.
Uniform-sample sanity floor — with probability uniform_sample_prob (default 0.2, exposed as --frontier-uniform-sample-prob), the picker bypasses frontier weighting and calls pick_seed directly. The PRM is itself learned from trajectories the self-play loop produces; without this floor a cold bias could self-reinforce onto a single cluster.
All weights zero — every non-saturated cluster has one signal at 0 (e.g., NaN crept in, or every candidate was excluded). Falls back rather than picking nothing.

Returned dict mirrors pick_seed's shape so call sites need no schema branching, plus extra fields for inspection:

{
  "mode": "frontier_weighted",
  "cluster_key": "...",
  "difficulty_tier": "...",
  "saturated_clusters": [...],
  "weight": float,          # combined uncertainty × rarity
  "uncertainty": float,
  "rarity": float,
  "hint": "FRONTIER TARGET (PRM-weighted): ..."
}

The hint string always begins with FRONTIER TARGET (PRM-weighted) so logs can distinguish it from the brittle-pool path. Fallback seeds carry an extra frontier_fallback key ("uniform_sample" or "no_positive_weight") for log attribution.

Gate conditions in Dreamer.synthetic_self_play use isinstance() rather than truthiness because MagicMock-backed test contexts return Mocks for any attribute access — type-checks fail closed for both ctx.prm_scorer and ctx.trajectory_collector.

Methods

Method	Purpose
`get_cluster_stats(key)`	Stats blob.
`get_brittle_clusters(limit)`	Top brittle clusters by score (saturated clusters filtered out).
`get_difficulty_tier(key)`	`basic` / `intermediate` / `advanced` / `expert`.
`get_difficulty_hint(key)`	Tier-specific prompt hint.
`_cluster_is_saturated(stats) → bool` (classmethod)	Check if last N runs are all trivial first-try wins.
`list_saturated_clusters() → list`	All currently-saturated cluster keys.
`pick_seed(random_explore_prob=0.2)`	Choose next cluster + hint for self-play (brittle-pool path). Returns exploration mode with `saturated_clusters` when all brittle candidates are saturated. Callers typically pass `random_explore_prob=0.35` (dream loop default) so even non-saturated picks get breathing room.
`pick_frontier_seed(uncertainty_by_cluster, rarity_by_cluster, uniform_sample_prob=0.2, random_explore_prob=0.35)`	Frontier-weighted alternative. Combines PRM uncertainty + trajectory rarity, excludes saturated clusters, falls back to `pick_seed` on empty signals / sanity-sample roll / all-zero weights. Returns the same dict shape as `pick_seed` with `mode="frontier_weighted"`.
`record_run(cluster_key, challenge, attempts_used, passed, description_length, mistake) → dict`	Log a run; returns `{compression_delta, mastered, is_new_cluster, …}`.
`adaptive_cooldown(base, floor, ceiling, cluster_key)`	Cooldown seconds based on recent progress. Used by both the biological watchdog (minutes–hours) and `self_play_loop` (5–180 s clamped).

Tests

Frontier-tracker invariants are pinned by tests/test_frontier_tracker.py; the new frontier-weighted picker is covered by tests/test_frontier_pick_frontier_seed.py (fallback ladder + dict-shape contract) and the dream integration by tests/test_dream_frontier_weighted.py (real PRM + real TrajectoryCollector, mocked LLM/sandbox).