core / self_play_scoring.py

Correctness-weighted curriculum metric that replaces the naive compression-delta score (which used to reward wrong one-shot solves).

Functions

Function	Description
`count_tool_errors(messages) → int`	Count tool-result messages that indicate tool failures (three-tier discrimination, below).
`correctness_weighted_score(passed, compression_delta, tool_errors, alpha=0.4, beta=0.1) → float`	Final score used by the frontier tracker and adaptive cooldown.

Formula

score = (1.0 + α · compression_delta if passed else 0.0) − β · tool_errors

α = 0.4   # reward for compressing the description after a successful solve
β = 0.1   # penalty per tool error

When passed=False, the base term is 0 — compression is ignored because a failed run's "efficient plan" wasn't correct. The tool-error penalty still applies so a failed run with many errors is ranked below a failed run with few.

Three-tier tool-error discrimination

The old heuristic substring-matched error markers anywhere in any tool content — which produced false positives whenever a fixture legitimately contained ERROR: / WARN / Traceback tokens (a log-counting challenge's log files, for example). The new count_tool_errors uses three tiers:

Data-tool whitelist: tool-result messages whose name is one of {file_system, recall, knowledge_base, web_search, deep_research, fact_check, postgres_admin} are never counted. Their content is retrieved data, not failure output.
Unambiguous patterns (counted anywhere in content): "Traceback (most recent call last)", "SYSTEM ERROR: Tool", "SYSTEM ALERT: You have failed", "CalledProcessError", "TimeoutExpired". These don't collide with fixture text.
Line-start prefixes (counted only when they appear at the start of content, after optional whitespace): Error:, ERROR:, error:, AssertionError, SyntaxError, NameError, IndentationError, ImportError, ModuleNotFoundError. Matches the shape the dispatch layer synthesises for genuine tool failures.

The old heuristic would score a legitimate log-counting solve as tool_errors=6 (5 file reads + 1 execute whose stdout mentions ERROR); the new one correctly scores it as 0.

Why

The previous metric rewarded compression even when the agent solved the wrong problem. This formulation makes correctness mandatory and penalises noisy tool use — but only actual tool noise, not fixture content — so brittle clusters are surfaced accurately to the FrontierTracker and the adaptive cooldown.