memory / skills.py — SkillMemory

Lessons learned from past mistakes. Vector-backed retrieval with BM25 re-rank; utility-driven retention; verified-first pinning.

Storage

JSON at memory_dir/skills_playbook.json.
Vector embeddings stored alongside in VectorMemory with metadata.type = "skill".
threading.RLock for JSON reads/writes; lazy-init fallback for tests.

Lesson schema (v2)

{
  "schema_version":      2,
  "timestamp":           "ISO",
  "task":                str,
  "trigger":             str,
  "mistake":             str,
  "anti_pattern":        str,
  "solution":            str,
  "correct_pattern":     str,
  "code_example":        str,
  "domains":             list[str],
  "confidence":          float (0-1),
  "source_challenge_hash": str,
  "verified":            bool,
  "verification_attempted": bool,
  "retrievals":          int,
  "helpful_retrievals":  int,
  "last_retrieved_at":   "ISO",
  "source":              str,
  "frequency":           int,
  "graduated":           bool
}

Retention & eviction

PLAYBOOK_MAX = 50. Trim policy:

Newest lesson is always kept (head pin).
Verified lessons pinned next.
Remaining slots filled by unverified, ranked by utility.

Utility formula

utility = confidence·0.5 + hit_rate·0.8 + (0.3 if verified) + log(freq)·0.1 + stale_penalty

Retrieval (three-tier, picked by `query` + `memory_system` presence)

get_playbook_context(query, memory_system, distance_threshold=0.45, limit, record_retrievals) dispatches to one of three paths:

Vector path — memory_system AND query both present. Runs a Chroma query over type="skill" docs, tightens by DEFAULT_RETRIEVAL_DISTANCE = 0.45, then re-ranks with a BM25-lite keyword-overlap bonus. Best path; used whenever the bus wires a vector tier.
BM25 fallback — query is present but memory_system is None (vector store offline, init failed, or a bus is wired without a vector tier). Does pure token-overlap against each lesson's trigger. Returns "" if no lesson shares tokens — never falls through to recency.
Recency fallback — no query supplied (system-prompt-injection usage). Returns the N most recent lessons regardless of content, under the ## RECENT LESSONS header.

Before the fix, the recency fallback fired whenever the vector path was unavailable — including on a real query. A question like "what's the capital of France?" would surface an unrelated Python-syntax lesson just because it happened to be the most recent entry, polluting context with junk. The new three-tier dispatch preserves the recency path ONLY for the no-query case (the SYSTEM_PROMPT cold-inject). Covered by tests/test_skills_bm25_fallback.py.

When a lesson is surfaced, record_retrieval increments the counter and emits a debug log keyed by source + source_challenge_hash — so the "is this self-play lesson ever actually used?" question has data to answer.

Dedup

Vector similarity check (distance < 0.15) or exact task match merges frequency + solution instead of inserting a duplicate.

Listing lessons (user-facing surface)

list_lessons(scope, source, limit) returns lessons filtered by time window and source, most-recent first. Boundaries use local wall-clock:

scope="today" — since local midnight.
scope="week" / "7d" — last 7 days.
scope="all" — entire playbook.
source="self_play" (etc.) — filter by provenance.

Surfaced to the LLM via the list_lessons tool (see tools / memory). Phrases like "what did you learn today?" / "what have you learned so far?" are routed to this tool via SYSTEM_PROMPT.

Public methods

Method	Purpose
`build_lesson(task, trigger, anti_pattern, correct_pattern, domains, confidence, source_challenge_hash, verified, source) → dict`	Factory.
`learn_lesson(task, mistake, solution, memory_system=None, **structured)`	Write lesson with dedup, merge, trim.
`list_lessons(scope="all", source="", limit=20) → list`	Time-window + source filter; local-time boundaries; most-recent first.
`record_retrieval(trigger)` / `record_helpful_retrieval(trigger)`	Hit-rate accounting. `record_retrieval` emits a debug log with `source_challenge_hash` for observability.
`credit_recent_retrievals(window_seconds=300)`	Bulk credit (idempotent per window).
`prune_low_utility(min_retrievals=5, max_drop_fraction=0.25)`	Drop bottom-quartile.
`get_recent_failures(limit) → str`	Format for LLM injection.
`find_by_trigger(trigger)` / `mark_verified` / `remove_by_trigger`	Index helpers used by the verification-grounded lesson flow in dream.py.