core / challenge_templates.py
Fast deterministic challenge library used as a drop-in replacement for slow LLM-generated self-play challenges.
Why
Asking the LLM to invent a self-contained challenge round-trips at 120–150 s with ~30% failure-to-validate rate. The template library produces deterministic, validator-clean challenges in milliseconds. The dream loop falls back to LLM generation when no template matches the seeded cluster, or — under saturation — on the 80%-weighted branch of the 20/80 coin-flip (the minority 20% path still picks a non-saturated template so the expert concurrency / algo / regex-parse shapes keep rotating).
Clusters
| Cluster | Template(s) | Basic shape | Advanced+ twist |
|---|---|---|---|
data_analysis | _data_analysis_csv_aggregation | CSV: filter by date, group, sum, sort. | Injects ~15% rows with value="NA"; solver must skip them before summing. |
regex_parse | _regex_parse_access_log | Count 5xx responses per IP from an access log. | Injects ~15% malformed lines (truncated, missing status); solver must skip without crashing. |
python_general | _python_general_word_frequency | Top-N word frequency. | Injects a stopword set (the/and/or/to/a) at 40% frequency; solver must exclude them from the top-N. |
algo | _algo_kth_largest | k-th largest integer from a file. | Switches to k-th largest distinct value; collapses duplicates, returns NONE when k exceeds the distinct count. |
sql | _sql_group_by_aggregation | SQLite sales aggregation. | Drops NOT NULL on amount and inserts ~15% NULL rows; solver must filter with WHERE amount IS NOT NULL. |
bash | _bash_filter_and_count | Count ERROR / WARN across log files. | Adds a third log level FATAL; output shape grows from 2 to 3 lines. |
concurrency | _concurrency_router → 8 variants | Tier-gated variant pool (see below). | — |
Difficulty tiers
Every template accepts a tier= keyword argument (basic | intermediate | advanced | expert). The tier drives two things:
- Size scaling. Base row / file / token counts are multiplied by
_TIER_SIZE_MULTIPLIER[tier]— 1×, 2×, 3×, 4× respectively. Bigger fixtures catch solutions that accidentally load the whole file, build O(N²) accumulators, or hardcode indices. - Hard-mode twist. At
advancedandexpert,_is_hard_modereturns true and each template adds its cluster-specific twist (see the table above). The twist is always described in the challenge prompt — it's a curriculum step, not a gotcha — but it's enough to stop a solution that 1-shotted the basic shape from passing without changes.
Tier is resolved per cluster by FrontierTracker.get_difficulty_tier, which stores a monotonic unlocked_tier_index that advances every TIER_UNLOCK_THRESHOLD first-try wins. The dream loop pipes this resolver into try_template and pick_random_template; unknown or None tiers fall back to basic (preserving behaviour for callers that don't know the frontier state).
Concurrency variant bank
The concurrency entry in TEMPLATES dispatches through _concurrency_router(tier=...). When a tier is known the router draws from _CONCURRENCY_POOLS_BY_TIER; without a tier it falls back to uniform sampling over the full 8-variant bank.
| Tier | Pool |
|---|---|
basic | parallel_sum, parallel_max_with_source |
intermediate | basic pool + shared_counter, bounded_pool |
advanced | shared_counter, bounded_pool, first_hit_racer, ordered_parallel_map |
expert | first_hit_racer, ordered_parallel_map, producer_consumer_exact_once, cancel_losers |
Pool membership is monotonic in difficulty: the basic pool is disjoint from the expert pool, so promoting a cluster to expert actually changes the variant distribution.
Variant behaviour:
_concurrency_parallel_sum— sum integers from N files concurrently._concurrency_parallel_max_with_source— argmax across files; workers must return index+value._concurrency_shared_counter— count token occurrences into a shared counter (requiresthreading.Lock)._concurrency_bounded_pool— process N files withmax_workers=K/ semaphore._concurrency_first_hit_racer— race to find a needle; cancel on first hit._concurrency_producer_consumer_exact_once— boundedqueue.Queue; every produced item consumed exactly once; validator reproduces the expected sum from the seeded RNG, so any drop/duplication fails the check._concurrency_ordered_parallel_map— parallel processing with output in input order. Traps solutions usingas_completedwithout re-sorting._concurrency_cancel_losers— winner-takes-all race; losers must be cancelled viathreading.Eventor the script exceeds a wall-clock budget the validator enforces.
API
| Function | Purpose |
|---|---|
try_template(cluster_key, tier=None) → Optional[ChallengeTriple] | Lookup + instantiate for a specific cluster at the given tier. tier=None renders at basic. |
pick_random_template(exclude_clusters=None, tier_resolver=None) → Optional[ChallengeTriple] | Random pick from the bank (optionally excluding saturated clusters). tier_resolver is a callable cluster_key → tier used after the draw — typically FrontierTracker.get_difficulty_tier. Maintains a module-level _LAST_TEMPLATE_KEY anchor to avoid drawing the same template twice in a row when alternatives exist. |
reset_template_history() | Clear the recent-template anchor; used by tests for deterministic first-call behaviour. |
_size(base, tier) | Scale a base count by the tier multiplier. |
_is_hard_mode(tier) | True for advanced / expert; gates the twist injection inside each template. |
_invoke_template(fn, tier) | Call fn(tier=...) when accepted, else fn() — preserves zero-arg callers (and zero-arg template monkey-patches in legacy tests). |
Design rules
- Each template is a pure function
(tier=None) → (challenge_prompt, setup_script, validator_script). - The setup uses a fixed seed for reproducibility; the validator computes expected values directly from the setup artefacts (no hardcoded values).
- Setup and validator communicate via the sandbox filesystem only — no shared memory.
- Both must be stdlib-only so they run inside an unprovisioned sandbox.
- Validators that use unit-suffixed numeric formats (%, $, ms) must strip the suffix before
float()/int(), or the validator self-test gate in dream.py will reject the template. - The hard-mode twist must be described in the challenge prompt. Never hide a new requirement — tier escalation is a curriculum, not a trap.
- Validators should be tier-agnostic when possible: a single validator that ignores NULL / NA / malformed inputs still produces the correct expected output at basic (where none exist) and at advanced (where many do).
ChallengeTriple = Tuple[str, str, str]