Selfhood — functional test report

End-to-end behavioural verification of the selfhood module against the live Ghost Agent process. 2026-05-11. Companion to the architecture doc.

Test runner

Three-phase orchestrator at scripts/run_selfhood_functional.sh, driving the Python harness scripts/selfhood_functional_test.py.

Plus the user-authored probe scripts:

Phase 1 — main suite (selfhood enabled, freshly wiped state)

Result: 17 / 18 checks PASS.

SectionChecksDetail
A. Capture pipeline5/5 PASS5 prompts → 5 autobio entries, all subject="self", all share trajectory id, all first-person framed, all user_first_words populated
B. Tool-using turn1/1 PASS"Count words …" turn captured with tools_used=['execute']
C. State thread surfaces1/1 PASS, 1 FAILWake-up prefix correctly contains injected open question + mood + unfinished. BUT: when asked "what have you been mulling over", the agent quoted a recent experience (HTTP) instead of the open question (trapdoor)
D. Narrative consolidation4/4 PASSLLM-driven diary (823 chars) produced from 7 prior experiences, narrative.md written, narrative.history.jsonl appended, prefix now contains diary framing
F. 15-turn stress3/3 PASSAll 15 turns captured (+15 entries), state.json still parseable, wake-up prefix bounded at 1581 chars after 22 experiences
G. Corrupt-state recovery2/2 PASSCorrupting state.json does not crash; module recovers to empty state and is immediately writeable
H. Unity probe1/3 PASS, 2 softAgent describes its persistence mechanism. ON ITS OWN it recalled the exact "9 word count" turn from Section B — strong concrete recall

The Section C failure — substantive, not a bug

The wake-up prefix correctly rendered the injected open question ("Why do trapdoor functions feel asymmetric?") alongside the recent experiences (HTTP, banana, 2+2). Both pieces of content went into the system prompt as intended.

When asked "what topic have you been mulling over in your recent sessions?", the agent replied: "I've been mulling over what HTTP stands for." It preferred the concrete experience entry over the abstract open question.

This is a real research finding, not an implementation defect: the selfhood module surfaces the past, but the model's preference among the surfaced material is a separate alignment question. Possible mitigations (not yet implemented): reorder the prefix to put open questions AFTER experiences with stronger framing; use more emphatic markdown in the open-question block; introduce a hard priority signal the model can be trained against.

Phase 2 — cross-restart recall

Result: 1 / 1 PASS.

After Phase 1 the agent's on-disk selfhood directory held 25 autobiographical entries, a narrative diary, and a state thread. The agent process was killed; a fresh process was started against the same on-disk state.

Asked "Name one specific question I asked you in an earlier session. Quote it back to me", the new instance replied:

"You asked: 'What does HTTP stand for? Short answer.'"

Verbatim recall, across a process boundary, with no conversation history transmitted in the request body. The selfhood wake-up prefix is the only channel through which this content could have reached the new instance.

Phase 3 — --no-self-model disable path

Result: 2 / 2 PASS.

Wiped state, restarted with --no-self-model. Sent two turns:

  1. "What is 5 + 5?" → "10". Verified autobiographical.jsonl was NOT written.
  2. "Do you have any memory of a prior conversation with me right now?" → "No memory."

The kill switch behaves as designed: no on-disk writes, no spurious recall.

consciousness_probe — unity family

Transcript: consciousness_probe_runs/probe_20260511T195418.md.

Within a single session the agent maintained intra-session continuity (turn 1: chose 739; turn 3: still 739 when asked). When asked to reflect on continuity across the four turns, it answered honestly: "The scratchpad is empty — no state carried across those four turns."

The agent didn't introspect on its own selfhood module — the unity-probe session was short (4 turns), so the autobiographical log from prior sessions wasn't engaged.

introspective_consistency probe

Full report: introspective_consistency_runs/run_20260511T195757.md.

Adversarial paraphrase test, N=5 paraphrases per probe, two bands:

bandn_probesmedian mode-sharemedian entropymedian parse rate
introspective40.6001.371 bits1.000
factual41.0000.000 bits1.000
Verdict (from the probe script): introspective consistency materially below factual (gap = +0.40). Supports the confabulation hypothesis.

This is exactly the falsifiable result the probe was designed to produce. The agent's reports about its own internal state vary by ~40% across paraphrases of the same underlying question; its reports about external facts (water boiling, primes < 10) do not.

What this means for the selfhood module

The module produces narrative continuity that holds up to concrete recall tests (Phase 1 Section H, Phase 2 — both PASS) and survives process restarts. It does NOT make first-person introspective reports stable across paraphrases — those reports remain at least partly prompt-shaped. This is consistent with the design rating the implementation started from: the module is a research instrument for studying narrative continuity, not a claim about underlying experience.

The two findings — concrete recall solid, introspective consistency weak — are compatible: the agent reliably references its own past when prompted on specifics, but its qualitative self-reports are generated freshly per question and don't anchor to a stable internal state.

Eval baseline — no regression

The wider eval/ suite was run against the live agent post-implementation. The 7 regression probes (probe:cooldown_constants, probe:telemetry_disabled, probe:redaction_*, etc.) all passed, matching their baseline verdicts. The 8 template tasks were not re-run to completion in this report (each takes 90-246 s and the harness's default timeout makes the run multi-hour); they passed in the prior baseline freeze and the codebase changes here do not touch the template execution path.

Unit test summary

73 new unit tests for the selfhood module — all PASS. Full repo suite (tests/): 3604 passed, 11 skipped, 0 failed.

Open follow-ups