memory / skills.py — SkillMemory

Lessons learned from past mistakes. Vector-backed retrieval with BM25 re-rank; utility-driven retention; verified-first pinning.

Storage

Lesson schema (v2)

{
  "schema_version":      2,
  "timestamp":           "ISO",
  "task":                str,
  "trigger":             str,
  "mistake":             str,
  "anti_pattern":        str,
  "solution":            str,
  "correct_pattern":     str,
  "code_example":        str,
  "domains":             list[str],
  "confidence":          float (0-1),
  "source_challenge_hash": str,
  "verified":            bool,
  "verification_attempted": bool,
  "retrievals":          int,
  "helpful_retrievals":  int,
  "last_retrieved_at":   "ISO",
  "source":              str,
  "frequency":           int,
  "graduated":           bool
}

Retention & eviction

PLAYBOOK_MAX = 50. Trim policy:

  1. Newest lesson is always kept (head pin).
  2. Verified lessons pinned next.
  3. Remaining slots filled by unverified, ranked by utility.

Utility formula

utility = confidence·0.5 + hit_rate·0.8 + (0.3 if verified) + log(freq)·0.1 + stale_penalty

Retrieval (three-tier, picked by query + memory_system presence)

get_playbook_context(query, memory_system, distance_threshold=0.45, limit, record_retrievals) dispatches to one of three paths:

  1. Vector pathmemory_system AND query both present. Runs a Chroma query over type="skill" docs, tightens by DEFAULT_RETRIEVAL_DISTANCE = 0.45, then re-ranks with a BM25-lite keyword-overlap bonus. Best path; used whenever the bus wires a vector tier.
  2. BM25 fallbackquery is present but memory_system is None (vector store offline, init failed, or a bus is wired without a vector tier). Does pure token-overlap against each lesson's trigger. Returns "" if no lesson shares tokens — never falls through to recency.
  3. Recency fallback — no query supplied (system-prompt-injection usage). Returns the N most recent lessons regardless of content, under the ## RECENT LESSONS header.

Before the fix, the recency fallback fired whenever the vector path was unavailable — including on a real query. A question like "what's the capital of France?" would surface an unrelated Python-syntax lesson just because it happened to be the most recent entry, polluting context with junk. The new three-tier dispatch preserves the recency path ONLY for the no-query case (the SYSTEM_PROMPT cold-inject). Covered by tests/test_skills_bm25_fallback.py.

When a lesson is surfaced, record_retrieval increments the counter and emits a debug log keyed by source + source_challenge_hash — so the "is this self-play lesson ever actually used?" question has data to answer.

Dedup

Vector similarity check (distance < 0.15) or exact task match merges frequency + solution instead of inserting a duplicate.

Listing lessons (user-facing surface)

list_lessons(scope, source, limit) returns lessons filtered by time window and source, most-recent first. Boundaries use local wall-clock:

Surfaced to the LLM via the list_lessons tool (see tools / memory). Phrases like "what did you learn today?" / "what have you learned so far?" are routed to this tool via SYSTEM_PROMPT.

Public methods

MethodPurpose
build_lesson(task, trigger, anti_pattern, correct_pattern, domains, confidence, source_challenge_hash, verified, source) → dictFactory.
learn_lesson(task, mistake, solution, memory_system=None, **structured)Write lesson with dedup, merge, trim.
list_lessons(scope="all", source="", limit=20) → listTime-window + source filter; local-time boundaries; most-recent first.
record_retrieval(trigger) / record_helpful_retrieval(trigger)Hit-rate accounting. record_retrieval emits a debug log with source_challenge_hash for observability.
credit_recent_retrievals(window_seconds=300)Bulk credit (idempotent per window).
prune_low_utility(min_retrievals=5, max_drop_fraction=0.25)Drop bottom-quartile.
get_recent_failures(limit) → strFormat for LLM injection.
find_by_trigger(trigger) / mark_verified / remove_by_triggerIndex helpers used by the verification-grounded lesson flow in dream.py.