core / challenge_templates.py

Fast deterministic challenge library used as a drop-in replacement for slow LLM-generated self-play challenges.

Why

Asking the LLM to invent a self-contained challenge round-trips at 120–150 s with ~30% failure-to-validate rate. The template library produces deterministic, validator-clean challenges in milliseconds. The dream loop falls back to LLM generation when no template matches the seeded cluster, or — under saturation — on the 80%-weighted branch of the 20/80 coin-flip (the minority 20% path still picks a non-saturated template so the expert concurrency / algo / regex-parse shapes keep rotating).

Clusters

ClusterTemplate(s)Basic shapeAdvanced+ twist
data_analysis_data_analysis_csv_aggregationCSV: filter by date, group, sum, sort.Injects ~15% rows with value="NA"; solver must skip them before summing.
regex_parse_regex_parse_access_logCount 5xx responses per IP from an access log.Injects ~15% malformed lines (truncated, missing status); solver must skip without crashing.
python_general_python_general_word_frequencyTop-N word frequency.Injects a stopword set (the/and/or/to/a) at 40% frequency; solver must exclude them from the top-N.
algo_algo_kth_largestk-th largest integer from a file.Switches to k-th largest distinct value; collapses duplicates, returns NONE when k exceeds the distinct count.
sql_sql_group_by_aggregationSQLite sales aggregation.Drops NOT NULL on amount and inserts ~15% NULL rows; solver must filter with WHERE amount IS NOT NULL.
bash_bash_filter_and_countCount ERROR / WARN across log files.Adds a third log level FATAL; output shape grows from 2 to 3 lines.
concurrency_concurrency_router → 8 variantsTier-gated variant pool (see below).

Difficulty tiers

Every template accepts a tier= keyword argument (basic | intermediate | advanced | expert). The tier drives two things:

Tier is resolved per cluster by FrontierTracker.get_difficulty_tier, which stores a monotonic unlocked_tier_index that advances every TIER_UNLOCK_THRESHOLD first-try wins. The dream loop pipes this resolver into try_template and pick_random_template; unknown or None tiers fall back to basic (preserving behaviour for callers that don't know the frontier state).

Concurrency variant bank

The concurrency entry in TEMPLATES dispatches through _concurrency_router(tier=...). When a tier is known the router draws from _CONCURRENCY_POOLS_BY_TIER; without a tier it falls back to uniform sampling over the full 8-variant bank.

TierPool
basicparallel_sum, parallel_max_with_source
intermediatebasic pool + shared_counter, bounded_pool
advancedshared_counter, bounded_pool, first_hit_racer, ordered_parallel_map
expertfirst_hit_racer, ordered_parallel_map, producer_consumer_exact_once, cancel_losers

Pool membership is monotonic in difficulty: the basic pool is disjoint from the expert pool, so promoting a cluster to expert actually changes the variant distribution.

Variant behaviour:

API

FunctionPurpose
try_template(cluster_key, tier=None) → Optional[ChallengeTriple]Lookup + instantiate for a specific cluster at the given tier. tier=None renders at basic.
pick_random_template(exclude_clusters=None, tier_resolver=None) → Optional[ChallengeTriple]Random pick from the bank (optionally excluding saturated clusters). tier_resolver is a callable cluster_key → tier used after the draw — typically FrontierTracker.get_difficulty_tier. Maintains a module-level _LAST_TEMPLATE_KEY anchor to avoid drawing the same template twice in a row when alternatives exist.
reset_template_history()Clear the recent-template anchor; used by tests for deterministic first-call behaviour.
_size(base, tier)Scale a base count by the tier multiplier.
_is_hard_mode(tier)True for advanced / expert; gates the twist injection inside each template.
_invoke_template(fn, tier)Call fn(tier=...) when accepted, else fn() — preserves zero-arg callers (and zero-arg template monkey-patches in legacy tests).

Design rules

  1. Each template is a pure function (tier=None) → (challenge_prompt, setup_script, validator_script).
  2. The setup uses a fixed seed for reproducibility; the validator computes expected values directly from the setup artefacts (no hardcoded values).
  3. Setup and validator communicate via the sandbox filesystem only — no shared memory.
  4. Both must be stdlib-only so they run inside an unprovisioned sandbox.
  5. Validators that use unit-suffixed numeric formats (%, $, ms) must strip the suffix before float()/int(), or the validator self-test gate in dream.py will reject the template.
  6. The hard-mode twist must be described in the challenge prompt. Never hide a new requirement — tier escalation is a curriculum, not a trap.
  7. Validators should be tier-agnostic when possible: a single validator that ignores NULL / NA / malformed inputs still produces the correct expected output at basic (where none exist) and at advanced (where many do).

ChallengeTriple = Tuple[str, str, str]