Skill acquisition

Two parallel paths: create_skill writes permanent Python tools; self-play writes retrieval-aware lessons. Both flow into the same utility-ranked retention.

Path A: `create_skill` — permanent Python tools

The LLM emits create_skill(name, description, parameters_schema, python_code, test_payload).
Entry-point normalization (2026-04-24) — python_code is run through sanitize_code() BEFORE anything hits disk. This strips CDATA envelopes that leaked from the XML tool-call parser, decodes corrupt HTML entities, and extracts code from markdown fences — all AST-gated so clean Python is never perturbed. If normalization fails, the LLM gets a specific actionable error (CDATA wrapper, HTML entities, truncated stream, escaped-newline confusion) instead of a generic test failure. See acquired_skills.html.
The normalized skill body is written TWICE: a transient copy to $GHOST_SANDBOX_DIR/test_skill.py for the TDD run, and (on success) the canonical copy to $GHOST_HOME/system/memory/acquired_skills/<name>.py. The canonical storage lives outside the sandbox bind-mount, so a docker volume rm / rm -rf $GHOST_SANDBOX_DIR does not destroy learned tools.
TDD validation — the transient file is executed via execute with test_payload JSON-encoded as sys.argv[1]. The skill must produce EXIT CODE: 0 with non-empty stdout. If it fails, registration is rejected AND the trace's pretty_log surfaces the one-line cause (e.g. ValueError: invalid literal for int()…) so the operator doesn't need to grep the agent log.
Persisted into skills_registry.json (co-located with the .py file under memory/acquired_skills/) with status="active", usage_count=0, failure_count=0, and a content hash.
The description is embedded into VectorMemory with metadata.type = "acquired_skill".

Migration from legacy sandbox storage. If an earlier install left skills under $GHOST_SANDBOX_DIR/acquired_skills/, they are copied into the canonical memory/acquired_skills/ on first AcquiredSkillManager construction when the canonical dir is empty. Idempotent: once the canonical store is populated, subsequent constructions are a no-op. Legacy files are left in place for manual cleanup after the operator verifies the move.

Routing

On each turn, registry.get_active_tool_definitions(query) queries vector memory for skills semantically related to the user query and injects only those into the LLM-facing tool list. This keeps the prompt short while still giving the LLM access to the entire archive.

Tool description hardening (2026-04-24). Each advertised skill is labelled [ACQUIRED SKILL — CALL BY NAME] and the description embeds a concrete invocation example using the skill's own name, plus an explicit list of forbidden wrap patterns (python -c, import, python3 acquired_skills/X.py). This closes the 8-turn greece_top_news loop where the LLM saw the skill in its tool list but didn't realise it was a top-level callable and kept trying to execute wrapper scripts. The registry's skill-runner closure reads the canonical file from memory/acquired_skills/ and passes content= to tool_execute, so execution still happens inside the sandbox — the source-of-truth file is just safe outside it.

Telemetry

Each invocation calls manager.log_telemetry(name, success):

Successful run: bump usage_count, reset failure_count.
Failure: bump failure_count; after 3 consecutive failures status flips to "degraded".

Retire

retire_degraded_skills() archives any skill matching failure_count ≥ 3 OR (failure_count ≥ 5 AND usage_count < 10). Retired skills move to acquired_skills/retired/ and are removed from both the active registry and the vector store so they no longer appear in semantic routing.

Path B: Self-play lessons

Dream-cycle self-play writes lessons into the SkillMemory playbook, subject to a three-gate extraction pipeline designed to prevent overfit or unhelpful entries.

Extract — Dreamer._extract_structured_lesson runs a meta-cognitive LLM call returning {trigger, anti_pattern, correct_pattern, domains, confidence, source_challenge_hash}. The prompt requires task-class triggers, forbids copying fixture literals, and mandates a non-empty taxonomy-compliant domain set.
Generalization guard — Dreamer._generalization_guard uses token n-gram overlap (_GENERALIZATION_MIN_NGRAM = 6) to reject lessons that restate the challenge or copy-paste constants from the setup / validator. Empty triggers, empty correct-patterns, and off-taxonomy domains are also rejected here.
Verification — Dreamer._verify_lesson_helpful re-runs the solver once with the lesson prepended under the production ### SKILL PLAYBOOK: header. The lesson is marked verified=True only if the outcome strictly improves (original-fail → verify-pass, or original ≥ 2 attempts → verify on attempt 1). Verified lessons pin in the playbook regardless of retrieval stats.

Retrieval feedback loop

SkillMemory.get_playbook_context(query) — vector-search (distance < 0.45) + BM25-lite re-rank, top 5 injected under the ### SKILL PLAYBOOK: header in both the planner and execution system prompts.
record_retrieval(trigger) fires on every injection and emits a debug log keyed by source + source_challenge_hash — lets us answer "is this self-play lesson ever actually used?" empirically.
record_helpful_retrieval(trigger) / credit_recent_retrievals(window_seconds=300) bump helpful_retrievals when a turn succeeds shortly after a lesson was surfaced.
prune_low_utility(min_retrievals=5, max_drop_fraction=0.25) drops lessons that have been surfaced many times but never marked helpful — non-verified only.

Retention formula

utility = confidence·0.5 + hit_rate·0.8 + (0.3 if verified) + log(freq)·0.1 + stale_penalty

Playbook capped at PLAYBOOK_MAX = 50. Trim order: head pin → verified pins → highest-utility unverified.

User-facing surface

The list_lessons(scope, limit) tool exposes the playbook directly. Phrases like "what did you learn today?" / "what have you learned so far?" are routed to this tool via SYSTEM_PROMPT. Read-only surface tools (list_lessons, recall, manage_skills) discharge the meta-task compliance nudge so the agent doesn't loop writing redundant no-op skills just to satisfy the "did you call learn_skill?" check.

Composed skills

Recurring tool-call sequences are pulled into ComposedSkill macros. A composed skill defines steps with parameter templates and conditional branches, executed via execute(skill_name, executor_fn, params). Up to 50 composed skills are kept; the least-used is evicted on overflow.