memory / frontier.py — FrontierTracker

Per-cluster self-play telemetry. Detects mastery, ranks brittleness, scaffolds difficulty tiers, and rotates away from saturated clusters.

Storage

Schema

{
  "runs": [{ "timestamp", "cluster_key", "challenge", "passed", "attempts_used",
             "length", "delta", "mistake" }],
  "clusters": {
    "<cluster_key>": {
      "runs": int,
      "best_length": int,
      "last_length": int,
      "last_compression": float,
      "mastered": bool,
      "recent_outcomes": [last 10 runs],
      "recent_hashes":   [last 20 challenge SHA1s],
      "total_first_try_wins": int,
      "unlocked_tier_index": int  // monotonic
    }
  }
}

Constants

MAX_RUNS
200 (tail evict)
DEDUP_WINDOW
20 challenge hashes — used to detect duplicate challenge texts (common on deterministic templates)
MASTERED_STREAK
5 consecutive first-try wins + ≥1 run with delta > 0.05
BRITTLE_WINDOW
3 — rolling window for brittleness scoring
SATURATION_WINDOW
2 — last N runs all first-try wins with near-zero delta (lowered from 3 after a 13-cycle loop produced 0 net lessons; at 2 the cluster rotates out one clean cycle after a struggle)
SATURATION_DELTA_EPSILON
0.001

Dedup behaviour on duplicate challenges

Deterministic templates (shop.db GROUP BY, data.csv aggregation, logs/ ERROR+WARN counting, etc.) produce byte-identical challenge text on every re-roll, so every re-roll hits recent_hashes as a duplicate. The dedup path splits two concerns:

The earlier implementation returned early on dedup without any state update. For deterministic templates the hash is stable across every re-roll, so a single struggled-then-won run would fossilise the cluster's recent_outcomes[-1] forever — no amount of subsequent clean wins could rotate the cluster out of the brittle pool. Incident 2026-04-21 11:56: 5 consecutive sql cycles all targeted the same template because the 09:06 DD struggled-then-won entry was pinned as the most-recent outcome.

Saturation detection

A cluster is saturated when _cluster_is_saturated(stats) returns True: its last SATURATION_WINDOW runs are all first-try passes with delta ≤ SATURATION_DELTA_EPSILON. Semantics: the template bank has no new learning signal for this cluster — continuing to target it burns cycles on material the agent already aces.

Saturated clusters are filtered out of _get_brittle_clusters_scored so the brittleness lottery never re-picks them. pick_seed handles the "everything saturated" case by returning mode="exploration" with a saturated_clusters list attached for the caller:

{"mode": "exploration", "cluster_key": None, "saturated_clusters": [...], "hint": "..."}

The caller (the dream loop) reads this and chooses a saturation-aware source: bumped journal probability, or a 20/80 coin-flip between pick_random_template(exclude_clusters=saturated) and LLM-gen (the majority path now falls through to LLM-gen for genuinely novel material instead of rotating to yet another deterministic template).

Brittleness scoring

brittleness = failures · 2  +  hard_wins(≥3 attempts) · 2  +  soft_wins(2 attempts) · 1

Saturated clusters are excluded from this score regardless of where they'd otherwise rank.

Recent-win decay guard: if the most recent run in the brittleness window is a clean first-try pass with delta ≤ SATURATION_DELTA_EPSILON, the cluster is excluded from the brittle pool even if older struggled-then-won runs in the window would otherwise score it. This prevents a single attempts=2 outlier from anchoring a cluster as "brittle" for the next 10+ cycles of clean wins — the pathology that caused a single hard-won sql challenge to keep pulling sql back into the frontier for the rest of a loop.

Tier unlocks

unlocked_tier_index only moves upward — once unlocked, a tier is sticky even after a regression. Tiers: basic, intermediate, advanced, expert. Each unlock costs TIER_UNLOCK_THRESHOLD = 3 cumulative first-try wins.

Tier drives two things in the self-play loop:

  1. LLM prompt hint. get_difficulty_hint(cluster) returns a one-liner from DIFFICULTY_HINTS describing the tier's complexity floor. The dream loop injects this into the challenge-generation prompt when the LLM path is used.
  2. Deterministic template scaling. get_difficulty_tier(cluster) is piped into try_template / pick_random_template as a resolver, and each template scales its problem size (1× / 2× / 3× / 4×) and activates its hard-mode twist (NA rows, malformed lines, stopwords, NULL columns, extra log levels, expert concurrency variants) when tier is advanced or expert.

Before this wiring, the tier was purely cosmetic: the template bank always rendered at basic and the agent 1-shot every cluster in production. Tier is now the mechanism that turns a first-try win at basic into a harder challenge at the next cycle, so a cluster can accumulate real mastery signal rather than saturating on a fixed shape.

Frontier-aware cluster selection (PRM-weighted)

Brittleness scoring sees outcomes but not coverage: a cluster the agent has barely tried looks identical to a cluster it solves first-try, because both have no recent failures. pick_frontier_seed is the extension that splits these by combining two complementary signals computed in the pure-function layer at core/frontier_selection.py:

The two are multiplied: a cluster needs both "we don't know much about it" AND "we don't have many examples" to win. Saturated clusters (per list_saturated_clusters()) are excluded with weight 0. pick_weighted samples in proportion.

Three transparent fallbacks restore the legacy pick_seed behaviour without behavioural drift:

  1. Empty signals — caller passed no uncertainty / no rarity (PRM untrained AND trajectory store empty). Preserves existing behaviour at cold-boot.
  2. Uniform-sample sanity floor — with probability uniform_sample_prob (default 0.2, exposed as --frontier-uniform-sample-prob), the picker bypasses frontier weighting and calls pick_seed directly. The PRM is itself learned from trajectories the self-play loop produces; without this floor a cold bias could self-reinforce onto a single cluster.
  3. All weights zero — every non-saturated cluster has one signal at 0 (e.g., NaN crept in, or every candidate was excluded). Falls back rather than picking nothing.

Returned dict mirrors pick_seed's shape so call sites need no schema branching, plus extra fields for inspection:

{
  "mode": "frontier_weighted",
  "cluster_key": "...",
  "difficulty_tier": "...",
  "saturated_clusters": [...],
  "weight": float,          # combined uncertainty × rarity
  "uncertainty": float,
  "rarity": float,
  "hint": "FRONTIER TARGET (PRM-weighted): ..."
}

The hint string always begins with FRONTIER TARGET (PRM-weighted) so logs can distinguish it from the brittle-pool path. Fallback seeds carry an extra frontier_fallback key ("uniform_sample" or "no_positive_weight") for log attribution.

Gate conditions in Dreamer.synthetic_self_play use isinstance() rather than truthiness because MagicMock-backed test contexts return Mocks for any attribute access — type-checks fail closed for both ctx.prm_scorer and ctx.trajectory_collector.

Methods

MethodPurpose
get_cluster_stats(key)Stats blob.
get_brittle_clusters(limit)Top brittle clusters by score (saturated clusters filtered out).
get_difficulty_tier(key)basic / intermediate / advanced / expert.
get_difficulty_hint(key)Tier-specific prompt hint.
_cluster_is_saturated(stats) → bool (classmethod)Check if last N runs are all trivial first-try wins.
list_saturated_clusters() → listAll currently-saturated cluster keys.
pick_seed(random_explore_prob=0.2)Choose next cluster + hint for self-play (brittle-pool path). Returns exploration mode with saturated_clusters when all brittle candidates are saturated. Callers typically pass random_explore_prob=0.35 (dream loop default) so even non-saturated picks get breathing room.
pick_frontier_seed(uncertainty_by_cluster, rarity_by_cluster, uniform_sample_prob=0.2, random_explore_prob=0.35)Frontier-weighted alternative. Combines PRM uncertainty + trajectory rarity, excludes saturated clusters, falls back to pick_seed on empty signals / sanity-sample roll / all-zero weights. Returns the same dict shape as pick_seed with mode="frontier_weighted".
record_run(cluster_key, challenge, attempts_used, passed, description_length, mistake) → dictLog a run; returns {compression_delta, mastered, is_new_cluster, …}.
adaptive_cooldown(base, floor, ceiling, cluster_key)Cooldown seconds based on recent progress. Used by both the biological watchdog (minutes–hours) and self_play_loop (5–180 s clamped).

Tests

Frontier-tracker invariants are pinned by tests/test_frontier_tracker.py; the new frontier-weighted picker is covered by tests/test_frontier_pick_frontier_seed.py (fallback ladder + dict-shape contract) and the dream integration by tests/test_dream_frontier_weighted.py (real PRM + real TrajectoryCollector, mocked LLM/sandbox).