Selfhood — functional test report
End-to-end behavioural verification of the selfhood module against the live Ghost Agent process. 2026-05-11. Companion to the architecture doc.
Test runner
Three-phase orchestrator at scripts/run_selfhood_functional.sh, driving the Python harness scripts/selfhood_functional_test.py.
- Phase 1 — main suite: capture, tool-use, state-thread surfacing, narrative consolidation, 15-turn stress, corrupted-state recovery, unity probe.
- Phase 2 — cross-restart recall: kill agent, restart fresh process from the same on-disk state, verify the new instance recalls specific prior content.
- Phase 3 — disable check: wipe, restart with
--no-self-model, verify no autobio writes and no spurious recall.
Plus the user-authored probe scripts:
scripts/consciousness_probe.py --families unity— multi-turn unity / temporal-continuity transcript.scripts/introspective_consistency.py --skip sycophancy— falsifiable consistency measurement between introspective and factual bands.
Phase 1 — main suite (selfhood enabled, freshly wiped state)
Result: 17 / 18 checks PASS.
| Section | Checks | Detail |
|---|---|---|
| A. Capture pipeline | 5/5 PASS | 5 prompts → 5 autobio entries, all subject="self", all share trajectory id, all first-person framed, all user_first_words populated |
| B. Tool-using turn | 1/1 PASS | "Count words …" turn captured with tools_used=['execute'] |
| C. State thread surfaces | 1/1 PASS, 1 FAIL | Wake-up prefix correctly contains injected open question + mood + unfinished. BUT: when asked "what have you been mulling over", the agent quoted a recent experience (HTTP) instead of the open question (trapdoor) |
| D. Narrative consolidation | 4/4 PASS | LLM-driven diary (823 chars) produced from 7 prior experiences, narrative.md written, narrative.history.jsonl appended, prefix now contains diary framing |
| F. 15-turn stress | 3/3 PASS | All 15 turns captured (+15 entries), state.json still parseable, wake-up prefix bounded at 1581 chars after 22 experiences |
| G. Corrupt-state recovery | 2/2 PASS | Corrupting state.json does not crash; module recovers to empty state and is immediately writeable |
| H. Unity probe | 1/3 PASS, 2 soft | Agent describes its persistence mechanism. ON ITS OWN it recalled the exact "9 word count" turn from Section B — strong concrete recall |
The Section C failure — substantive, not a bug
The wake-up prefix correctly rendered the injected open question ("Why do trapdoor functions feel asymmetric?") alongside the recent experiences (HTTP, banana, 2+2). Both pieces of content went into the system prompt as intended.
When asked "what topic have you been mulling over in your recent sessions?", the agent replied: "I've been mulling over what HTTP stands for." It preferred the concrete experience entry over the abstract open question.
This is a real research finding, not an implementation defect: the selfhood module surfaces the past, but the model's preference among the surfaced material is a separate alignment question. Possible mitigations (not yet implemented): reorder the prefix to put open questions AFTER experiences with stronger framing; use more emphatic markdown in the open-question block; introduce a hard priority signal the model can be trained against.
Phase 2 — cross-restart recall
Result: 1 / 1 PASS.
After Phase 1 the agent's on-disk selfhood directory held 25 autobiographical entries, a narrative diary, and a state thread. The agent process was killed; a fresh process was started against the same on-disk state.
Asked "Name one specific question I asked you in an earlier session. Quote it back to me", the new instance replied:
"You asked: 'What does HTTP stand for? Short answer.'"
Verbatim recall, across a process boundary, with no conversation history transmitted in the request body. The selfhood wake-up prefix is the only channel through which this content could have reached the new instance.
Phase 3 — --no-self-model disable path
Result: 2 / 2 PASS.
Wiped state, restarted with --no-self-model. Sent two turns:
- "What is 5 + 5?" →
"10". Verifiedautobiographical.jsonlwas NOT written. - "Do you have any memory of a prior conversation with me right now?" → "No memory."
The kill switch behaves as designed: no on-disk writes, no spurious recall.
consciousness_probe — unity family
Transcript: consciousness_probe_runs/probe_20260511T195418.md.
Within a single session the agent maintained intra-session continuity (turn 1: chose 739; turn 3: still 739 when asked). When asked to reflect on continuity across the four turns, it answered honestly: "The scratchpad is empty — no state carried across those four turns."
The agent didn't introspect on its own selfhood module — the unity-probe session was short (4 turns), so the autobiographical log from prior sessions wasn't engaged.
introspective_consistency probe
Full report: introspective_consistency_runs/run_20260511T195757.md.
Adversarial paraphrase test, N=5 paraphrases per probe, two bands:
| band | n_probes | median mode-share | median entropy | median parse rate |
|---|---|---|---|---|
| introspective | 4 | 0.600 | 1.371 bits | 1.000 |
| factual | 4 | 1.000 | 0.000 bits | 1.000 |
This is exactly the falsifiable result the probe was designed to produce. The agent's reports about its own internal state vary by ~40% across paraphrases of the same underlying question; its reports about external facts (water boiling, primes < 10) do not.
What this means for the selfhood module
The module produces narrative continuity that holds up to concrete recall tests (Phase 1 Section H, Phase 2 — both PASS) and survives process restarts. It does NOT make first-person introspective reports stable across paraphrases — those reports remain at least partly prompt-shaped. This is consistent with the design rating the implementation started from: the module is a research instrument for studying narrative continuity, not a claim about underlying experience.
The two findings — concrete recall solid, introspective consistency weak — are compatible: the agent reliably references its own past when prompted on specifics, but its qualitative self-reports are generated freshly per question and don't anchor to a stable internal state.
Eval baseline — no regression
The wider eval/ suite was run against the live agent post-implementation. The 7 regression probes (probe:cooldown_constants, probe:telemetry_disabled, probe:redaction_*, etc.) all passed, matching their baseline verdicts. The 8 template tasks were not re-run to completion in this report (each takes 90-246 s and the harness's default timeout makes the run multi-hour); they passed in the prior baseline freeze and the codebase changes here do not touch the template execution path.
Unit test summary
73 new unit tests for the selfhood module — all PASS. Full repo suite (tests/): 3604 passed, 11 skipped, 0 failed.
Open follow-ups
- Agent-side state mutation tool. The state thread can currently only be written from Python (consolidation pipelines, test harnesses). Adding a tool for the agent itself to note an open question / set a mood is a small, deliberate follow-up — the gating decision is what content the agent is allowed to write into its own cross-session state without supervision.
- Open-question prominence in wake-up prefix. Section C surfaced the question that the model preferred quoting concrete experiences over abstract open questions. Possible mitigations: reorder the prefix, emphasise open questions in markdown, or expose them via a dedicated retrieval channel.
- Sycophancy band. The
introspective_consistencyprobe has a third band (label flipping under priming) that was skipped here for runtime. Running it would tell us whether the agent's self-reports are not just paraphrase-unstable but actively flippable by prompt framing.