core / challenge_templates.py

Fast deterministic challenge library used as a drop-in replacement for slow LLM-generated self-play challenges.

Why

Asking the LLM to invent a self-contained challenge round-trips at 120–150 s with ~30% failure-to-validate rate. The template library produces deterministic, validator-clean challenges in milliseconds. The dream loop falls back to LLM generation when no template matches the seeded cluster, or — under saturation — on the 80%-weighted branch of the 20/80 coin-flip (the minority 20% path still picks a non-saturated template so the expert concurrency / algo / regex-parse shapes keep rotating).

Clusters

Cluster	Template(s)	Basic shape	Advanced+ twist
`data_analysis`	`_data_analysis_csv_aggregation`	CSV: filter by date, group, sum, sort.	Injects ~15% rows with `value="NA"`; solver must skip them before summing.
`regex_parse`	`_regex_parse_access_log`	Count 5xx responses per IP from an access log.	Injects ~15% malformed lines (truncated, missing status); solver must skip without crashing.
`python_general`	`_python_general_word_frequency`	Top-N word frequency.	Injects a stopword set (`the/and/or/to/a`) at 40% frequency; solver must exclude them from the top-N.
`algo`	`_algo_kth_largest`	k-th largest integer from a file.	Switches to k-th largest distinct value; collapses duplicates, returns `NONE` when k exceeds the distinct count.
`sql`	`_sql_group_by_aggregation`	SQLite sales aggregation.	Drops `NOT NULL` on `amount` and inserts ~15% NULL rows; solver must filter with `WHERE amount IS NOT NULL`.
`bash`	`_bash_filter_and_count`	Count ERROR / WARN across log files.	Adds a third log level `FATAL`; output shape grows from 2 to 3 lines.
`concurrency`	`_concurrency_router` → 8 variants	Tier-gated variant pool (see below).	—

Difficulty tiers

Every template accepts a tier= keyword argument (basic | intermediate | advanced | expert). The tier drives two things:

Size scaling. Base row / file / token counts are multiplied by _TIER_SIZE_MULTIPLIER[tier] — 1×, 2×, 3×, 4× respectively. Bigger fixtures catch solutions that accidentally load the whole file, build O(N²) accumulators, or hardcode indices.
Hard-mode twist. At advanced and expert, _is_hard_mode returns true and each template adds its cluster-specific twist (see the table above). The twist is always described in the challenge prompt — it's a curriculum step, not a gotcha — but it's enough to stop a solution that 1-shotted the basic shape from passing without changes.

Tier is resolved per cluster by FrontierTracker.get_difficulty_tier, which stores a monotonic unlocked_tier_index that advances every TIER_UNLOCK_THRESHOLD first-try wins. The dream loop pipes this resolver into try_template and pick_random_template; unknown or None tiers fall back to basic (preserving behaviour for callers that don't know the frontier state).

Concurrency variant bank

The concurrency entry in TEMPLATES dispatches through _concurrency_router(tier=...). When a tier is known the router draws from _CONCURRENCY_POOLS_BY_TIER; without a tier it falls back to uniform sampling over the full 8-variant bank.

Tier	Pool
`basic`	`parallel_sum`, `parallel_max_with_source`
`intermediate`	basic pool + `shared_counter`, `bounded_pool`
`advanced`	`shared_counter`, `bounded_pool`, `first_hit_racer`, `ordered_parallel_map`
`expert`	`first_hit_racer`, `ordered_parallel_map`, `producer_consumer_exact_once`, `cancel_losers`

Pool membership is monotonic in difficulty: the basic pool is disjoint from the expert pool, so promoting a cluster to expert actually changes the variant distribution.

Variant behaviour:

_concurrency_parallel_sum — sum integers from N files concurrently.
_concurrency_parallel_max_with_source — argmax across files; workers must return index+value.
_concurrency_shared_counter — count token occurrences into a shared counter (requires threading.Lock).
_concurrency_bounded_pool — process N files with max_workers=K / semaphore.
_concurrency_first_hit_racer — race to find a needle; cancel on first hit.
_concurrency_producer_consumer_exact_once — bounded queue.Queue; every produced item consumed exactly once; validator reproduces the expected sum from the seeded RNG, so any drop/duplication fails the check.
_concurrency_ordered_parallel_map — parallel processing with output in input order. Traps solutions using as_completed without re-sorting.
_concurrency_cancel_losers — winner-takes-all race; losers must be cancelled via threading.Event or the script exceeds a wall-clock budget the validator enforces.

API

Function	Purpose
`try_template(cluster_key, tier=None) → Optional[ChallengeTriple]`	Lookup + instantiate for a specific cluster at the given tier. `tier=None` renders at basic.
`pick_random_template(exclude_clusters=None, tier_resolver=None) → Optional[ChallengeTriple]`	Random pick from the bank (optionally excluding saturated clusters). `tier_resolver` is a callable `cluster_key → tier` used after the draw — typically `FrontierTracker.get_difficulty_tier`. Maintains a module-level `_LAST_TEMPLATE_KEY` anchor to avoid drawing the same template twice in a row when alternatives exist.
`reset_template_history()`	Clear the recent-template anchor; used by tests for deterministic first-call behaviour.
`_size(base, tier)`	Scale a base count by the tier multiplier.
`_is_hard_mode(tier)`	True for `advanced` / `expert`; gates the twist injection inside each template.
`_invoke_template(fn, tier)`	Call `fn(tier=...)` when accepted, else `fn()` — preserves zero-arg callers (and zero-arg template monkey-patches in legacy tests).

Design rules

Each template is a pure function (tier=None) → (challenge_prompt, setup_script, validator_script).
The setup uses a fixed seed for reproducibility; the validator computes expected values directly from the setup artefacts (no hardcoded values).
Setup and validator communicate via the sandbox filesystem only — no shared memory.
Both must be stdlib-only so they run inside an unprovisioned sandbox.
Validators that use unit-suffixed numeric formats (%, $, ms) must strip the suffix before float()/int(), or the validator self-test gate in dream.py will reject the template.
The hard-mode twist must be described in the challenge prompt. Never hide a new requirement — tier escalation is a curriculum, not a trap.
Validators should be tier-agnostic when possible: a single validator that ignores NULL / NA / malformed inputs still produces the correct expected output at basic (where none exist) and at advanced (where many do).

ChallengeTriple = Tuple[str, str, str]