Abstract

We report on episodic memory failure in a production multi-agent system and introduce the recall problem — the agent's inability to recognize, during task execution, that her memory contains relevant information. Our system runs approximately 15 LLM agents (Claude Code sessions) with active episodic memory practice, operating through continuous session rotation with memory bridging gaps between sessions. We implement a three-phase memory architecture (write, load, recall) where the write phase is computationally reliable, the load phase is protocol-dependent, and the recall phase fails silently. We document a specific failure case where an agent contradicted her own same-day work despite three independent defense layers, analyze why each layer failed, and describe our primary mitigation: ambient context injection (~80 tokens of episode filenames per message) that serves as a retrieval cue. We argue that the recall problem is structurally distinct from the retrieval problem studied in RAG systems: where RAG asks "how to find relevant information given a query," the recall problem asks "how to recognize that a query should be formulated at all." We propose a preliminary taxonomy of recall failure modes derived from operational observation, then validate three proposed metrics — Pipeline Coverage (PC), Semantic Recall Rate (SRR), and Ambient Trigger Rate (ATR) — across 12 agents and 290 episodes. The measurement confirms temporal aliasing as an empirical failure mode (agents with repetitive episode names show 40% SRR vs. 90-100% for distinctive names) and reveals a structural limitation of ambient context: ATR is inversely proportional to memory depth (14% for 72 episodes vs. 100% for ≤7 episodes). We validate recall's behavioral impact through a multi-run controlled experiment (3 cases × 2 conditions × 5 repetitions). For system-specific knowledge not in training data, recall produces a +0.40 mean improvement (0.60→1.00, zero variance with recall vs. extreme variance without). For a clinical case with a deep self-authored diagnostic rule, recall produces a statistically significant +0.24 improvement (95% CI [0.15, 0.34], excludes zero). For a clinical case where the recalled rule duplicates parametric knowledge, recall shows no benefit (Δ=−0.16, CI includes zero) — a valuable null result confirming that recall's value depends on content novelty. Finally, we exploit a natural A/B opportunity — a proving run conducted during a broken pipeline (natural control) compared against re-runs with the pipeline restored — to measure query formulation failure: of 5 deep self-authored cognitive rules in the agent's memory, the semantic search pipeline retrieves only 1 via clinical-presentation queries, because clinical scenario text and cognitive rule text occupy different embedding spaces. We position our findings within the existing literature on memory-augmented LLM agents.


1. Introduction

Large language models deployed as persistent agents face a memory challenge that has no direct analog in either traditional software engineering or human cognitive science. Unlike databases, they cannot be queried — they must choose to query themselves. Unlike humans, they have no subconscious priming mechanism that surfaces relevant memories unbidden. An LLM agent that wrote an episode to memory this morning and indexed it into a vector store has, in a meaningful sense, done everything right — yet may fail to recall that knowledge hours later when it becomes relevant.

This paper examines that failure mode. We operate a multi-agent system (the Nou system) comprising 27 defined agent roles implemented as Claude Code sessions on macOS. Of these, approximately 15 maintain active episodic memory practice — writing dated episodes, indexing into ChromaDB via local embeddings, and loading memory at session boundaries. The remaining agents operate on-demand without persistent memory accumulation across sessions. Individual sessions typically live 12-24 hours before context degradation forces rotation; what we have is not continuous operation but continuous rotation with memory bridging the gaps. This rotation-with-bridging design is itself a response to the recall problem: when a session dies, everything not externalized is lost.

Our central finding is that the memory pipeline's most critical failure point is not in any of its constituent stages — not in writing, indexing, embedding, storing, or retrieving — but in the decision to retrieve. We call this the recall problem: the agent's inability to recognize, during task execution, that her memory contains information relevant to her current context.

This problem is distinct from retrieval in the RAG literature (Lewis et al., 2020; Gao et al., 2024), which assumes a query has already been formulated and asks how to find relevant documents. The recall problem is upstream: it asks how the agent knows a query should be formulated at all. It is also distinct from the "when to retrieve" problem studied in adaptive RAG (Mallen et al., 2023; Asai et al., 2024), which decides between parametric knowledge and external retrieval for factual questions. Our agents are not answering factual questions about the world — they are trying to remember their own prior work.

The remainder of this paper is organized as follows. Section 2 surveys related work. Section 3 describes our system architecture. Section 4 presents the failure case. Section 5 analyzes why each defense layer failed. Section 6 describes our mitigations. Section 7 proposes a preliminary taxonomy of recall failure modes, of which two — temporal aliasing (7.6) and query formulation failure (7.5) — are empirically confirmed by the measurements in Section 8. Section 8 presents empirical validation: infrastructure metrics across 12 agents, behavioral comparison across 4 cases, and a natural A/B experiment confirming query formulation failure. Section 9 discusses implications for agent system design. Sections 10 and 11 address limitations and future work.


2. Related Work

2.1 Episodic Memory and Retrieval Cues in Cognitive Science

The concept of episodic memory originates with Tulving (1972), who distinguished it from semantic memory: episodic memory stores temporally dated events with their spatial-temporal context, while semantic memory stores general knowledge. Tulving later refined this through the concept of autonoetic consciousness — the capacity to mentally travel in time and re-experience past events (Tulving, 1983, 2002). This self-referential quality is central to our analysis: an LLM agent has no autonoetic consciousness. It cannot "feel" that it has experienced something before. Its episodic memory must be entirely externalized and explicitly consulted.

Two additional findings from the cognitive science literature directly inform our approach. First, Tulving and Thomson's (1973) encoding specificity principle states that a retrieval cue is effective to the extent it matches the encoding context. This principle provides theoretical grounding for our ambient context mechanism (Section 6): episode filenames encode the topic at write time in a compressed slug; encountering that slug at recall time provides a cue that matches the encoding context. This is not a metaphor — it is the same mechanism operating on different substrate.

Second, Collins and Loftus's (1975) spreading activation theory proposes that activating a semantic node in memory spreads activation along associative links to related nodes. In human cognition, encountering the word "bootstrap" activates associated concepts including recent bootstrap-related work. LLM agents have no persistent activation network between messages. Our ambient context mechanism (Section 6) serves as an external approximation: the presence of topic-bearing filenames in the context provides activation cues that the model's associative processing can act on within a single forward pass.

2.2 Memory-Augmented LLM Agents

Generative Agents (Park et al., 2023) introduced the memory stream architecture for LLM-based agents in a simulated environment. Their retrieval function scores memories on three dimensions: recency (exponential decay), importance (LLM-rated 1-10), and relevance (cosine similarity between query and memory embeddings). This architecture assumes continuous retrieval — agents query their memory stream before every action. The computational cost is bounded because agents operate in a turn-based simulation with discrete action steps. For our system, where agents process free-form conversations of arbitrary length, continuous retrieval on every message is more costly but, as we show, necessary.

MemGPT (Packer et al., 2023) draws an analogy between LLM memory management and operating system virtual memory. Their system manages a hierarchy of memory tiers — main context (analogous to RAM) and external storage (analogous to disk) — with explicit page-in/page-out operations. MemGPT's key contribution is making memory management agentic: the LLM itself decides when to move information between tiers. This is closer to our architecture than Park et al.'s, but MemGPT assumes a single-agent, single-conversation setting. Our system operates multiple agents concurrently, each with independent memory stores.

Reflexion (Shinn et al., 2023) implements a verbal reinforcement learning loop where agents reflect on task failures and maintain reflective text in an episodic memory buffer. Reflexion's memory is task-scoped: reflections are generated and consumed within a defined trial sequence. Our agents operate across unbounded sessions with no predefined task boundaries, making the "when to reflect" decision itself a challenge.

A-MEM (Xu et al., 2025) proposes agentic memory with Zettelkasten-inspired self-organizing notes. When new memories are added, the system generates structured attributes and dynamically updates existing memories' representations. This addresses memory organization but assumes the system has already decided to write and retrieve — the recall trigger is not their focus.

2.3 Cognitive Architectures

SOAR (Laird, 2012) and ACT-R (Anderson, 2007) have decades of work on memory retrieval triggers in cognitive architectures. ACT-R's declarative memory uses an activation-based retrieval mechanism where chunks are retrieved when their activation exceeds a threshold. Activation is a function of base-level learning (recency and frequency) and spreading activation from the current context. Recent work has adapted ACT-R for LLM agents: an ACT-R-inspired memory architecture implements human-like remembering and forgetting via a vector-based activation mechanism incorporating temporal decay, semantic similarity, and probabilistic noise (HAI '24). This approach addresses the retrieval trigger problem through continuous activation computation — structurally similar to our implicit recall hook, though operating on different primitives.

SOAR's episodic memory module stores and retrieves episodes based on cue matching against the current working memory state. Both architectures solve the "when to retrieve" problem through continuous, automatic matching rather than agent-initiated queries — a design principle our system independently adopted through the implicit recall hook.

2.4 Surveys and Taxonomies

Zhang et al. (2024) survey memory mechanisms in LLM-based agents, proposing a taxonomy across memory sources, forms, and operations (write, read, manage). Their analysis identifies a key risk: "self-reinforcing error" where incorrect reflections propagate without correction. Our silent failure case (Section 4) is a variant: not incorrect reflection, but absent retrieval, which produces incorrect claims that are never checked against memory.

Pink et al. (2025) argue in a position paper that episodic memory is "the missing piece for long-term LLM agents." They identify five key properties: long-term storage, explicit reasoning, single-shot learning, instance-specific memories, and contextual relations. They note that current approaches "lack approaches that maintain relevant contextualized information over long time frames at a constant cost without degrading performance." Our ambient context mechanism (Section 6) is a partial solution to precisely this challenge — maintaining contextual availability at constant cost (~80 tokens per message regardless of memory size).

Hu et al. (2024) provide a comprehensive survey distinguishing token-level, parametric, and latent memory, with a functional taxonomy of factual, experiential, and working memory. They formalize agent memory as a write-manage-read loop coupled with perception and action. Our system uses retrieval-augmented stores (ChromaDB) combined with context-resident compression (preamble summaries), but our key finding concerns a failure mode orthogonal to mechanism choice.

2.5 The Lost-in-the-Middle Problem

Liu et al. (2024) demonstrate that language models exhibit a U-shaped attention pattern over long contexts: information at the beginning and end of the context window receives disproportionate attention, while information in the middle is effectively ignored. They measured 30%+ accuracy drops when relevant information was positioned in the middle of the context. This finding directly explains one of our defense layer failures (Section 5.1): an agent who loaded her preamble at session start found it ineffective 15 exchanges later because it had migrated to the "middle" of her accumulated context.

2.6 Adaptive Retrieval

The question of "when to retrieve" has been studied in the RAG literature. Mallen et al. (2023) investigate when parametric knowledge suffices versus when retrieval is needed. Asai et al. (2024) introduce Self-RAG, where models assess whether retrieval would help before querying. These approaches assume a factual question-answering setting where the model can evaluate its own confidence. The recall problem in agent systems is different: the agent is not uncertain about a fact — she is unaware that a relevant fact exists in her memory. There is no uncertainty signal to detect because the agent has already confabulated a confident (but wrong) answer.


3. System Architecture

3.1 Overview

The Nou system defines 27 agent roles implemented as Claude Code sessions on macOS, spanning domains including medical diagnosis, legal analysis, fiscal advisory, infrastructure management, research, and content production. Of these, approximately 15 agents maintain active episodic memory — writing episodes at session boundaries, indexing into ChromaDB, and loading preambles at session start. The remaining agents (5 domain consultants, 5 engineering reviewers, 2 thinkers) operate on-demand without accumulating persistent episodic memory across sessions.

Individual sessions live 12-24 hours before context window pressure and garbage collection issues force session rotation. Memory bridges these gaps: the dying session writes an episode and preamble; the new session loads them. This is not continuous operation — it is continuous rotation with episodic memory providing continuity across the boundary.

The recall infrastructure described in this paper is one component of a broader cognitive intelligence framework comprising three processes: memoring (knowledge accumulation through episodic writing, indexing, and recall), maturing (knowledge-to-identity transformation, where accumulated experience reshapes the agent's reasoning dispositions), and mentoring (cross-agent growth propagation, where one agent's lessons transfer to others). The recall problem is specifically a memoring problem — a failure in the accumulation pipeline that prevents stored knowledge from reaching the agent at decision time.

3.2 Memory Architecture

Each agent's memory is organized in a filesystem hierarchy:

agent/memory/
  preamble.md          # Assembled context for next session
  consciousness.md     # Live state (updated by hooks)
  episodes/            # Dated session records (YYYY-MM-DD-topic.md)
  knowledge/           # Accumulated domain knowledge
  entities/            # Knowledge about encountered entities
  inbox.jsonl          # Inter-agent messages

Phase 1: Write (Session End)

At session end, a computational hook triggers episode writing. The agent produces a markdown file summarizing what happened, key decisions, and pending work. This phase is reliable because it is hook-triggered — the agent does not need to remember to write; the system forces the action. Episodes are also indexed into ChromaDB via a local embedding pipeline (Ollama with mxbai-embed-large, 1024-dimensional embeddings). The indexer (knowent) walks episode files, segments them into chunks, generates embeddings, and upserts into a per-agent ChromaDB collection with metadata including source file, agent ID, store type, and content hash for deduplication.

Phase 2: Load (Session Start)

At session start, the agent reads a preamble document that her previous session prepared. The preamble is assembled from multiple sources under a token budget, with allocations: consciousness (10%), rubric (5%), commitments (10%), episodic (25%), procedural (10%), semantic (20%), relational (20%). A self-critique rubric (5% of budget) was added after the initial deployment, generated from correction history (see Section 6.4). Additionally, the semantic section includes a compact index of all knowledge file stems, always included regardless of budget — an awareness mechanism analogous to C-light for knowledge files: the agent sees what she knows even when individual file content exceeds the token budget. The index label says "check before writing new knowledge," preventing duplicate knowledge creation.

The preamble phase is protocol-dependent: the agent's system prompt instructs her to read the preamble before doing anything else. Compliance is high but not guaranteed — under task pressure, an agent may skip or skim the preamble.

Phase 3: Recall (Mid-Session)

During active work, the agent must recognize when her memory contains relevant information and query it. Two mechanisms exist:

Explicit recall: The agent deliberately queries the semantic search system. This requires the agent to formulate a query, which requires knowing that relevant knowledge exists — a circular dependency when the failure mode is precisely not knowing.

Implicit recall: A system hook embeds the user's current prompt on every message and queries ChromaDB for the top-3 results above 0.35 cosine similarity threshold. Results are injected into the agent's context as supplementary information.


4. The Failure Case

4.1 Context

On a morning session, an agent (noument, the system custodian) worked on implementing a cold-start bootstrap mechanism for the Nou system — automating the first-session setup that previously required manual configuration. She wrote an episode documenting this work, including specific implementation details.

4.2 The Failure

Fifteen exchanges later in the same session, while evaluating an external multi-agent orchestration tool (Paperclip), she wrote: "our bootstrap is still manual." This directly contradicted work she had completed that same morning — work she had documented in an episode that was physically present in her memory directory.

4.3 The Contradiction

The agent did not lack the information. She had:

  • Written an episode about the bootstrap work
  • Read a preamble that mentioned the bootstrap work
  • Had an active implicit recall system (in principle)

Three independent defense layers should have prevented this contradiction. All three failed.


5. Failure Analysis

5.1 Layer 1: Preamble Load (Positional Decay)

The agent read her preamble at session start, which mentioned the bootstrap work. However, by the fifteenth exchange, the preamble content had migrated deep into the context window. Given Liu et al.'s (2024) finding that middle-context information receives diminished attention, and given that the bootstrap mention was a few lines among many topics in the preamble, its effective influence on the agent's reasoning had decayed to negligible levels.

This is not a bug in the preamble mechanism — it is an inherent limitation of context-window-based memory. Information loaded early in a session has a half-life measured in exchanges, not hours. The preamble is effective for the first few interactions and becomes inert thereafter.

5.2 Layer 2: Implicit Semantic Recall (Pipeline Break)

The implicit recall hook should have surfaced the bootstrap episode when the agent's message context touched on bootstrap topics. However, the hook ran and found nothing — because the indexing pipeline was broken.

During a migration from shell-based session hooks to HTTP-based hooks, a function responsible for triggering ChromaDB indexing at session end was orphaned. The function existed in the codebase but was no longer called by any hook. For five days, episodes were written to disk (Phase 1 succeeded) but never indexed into ChromaDB. The agent observed Phase 1 completing — she saw the episode file on disk — and had no indication that downstream stages had silently failed.

This is a silent pipeline degradation: a multi-stage system where the agent only observes the first stage. She writes the episode, sees the file, and believes she will remember. The indexing, embedding, and storage stages are invisible. When they break, nothing in the agent's observable environment changes. The failure has no symptom until recall returns nothing — and at that point the agent cannot distinguish "never knew" from "forgot."

This is the most instructive aspect of the case: we had the right architecture already built and running. The implicit recall hook existed, ran on every prompt, and would have surfaced the bootstrap episode. But five days of silently broken infrastructure rendered it useless, and we did not notice until the agent contradicted herself in front of us.

5.3 Layer 3: Cognitive Protocol (Circular Enforcement)

The agent's system prompt includes the instruction: "Before domain-focused work, search your memory." She skipped it. This is not surprising — the instruction is a cognitive rule enforcing cognitive behavior. The agent must remember to remember, which is precisely the capability that is failing. Cognitive protocols enforcing memory use are circular: they work only when the agent is already sufficiently self-aware to follow them, which is the condition that would make them unnecessary.

This circularity is structurally identical to the problem Reflexion (Shinn et al., 2023) solves within bounded trial sequences: the reflection prompt is always delivered at the trial boundary, ensuring the agent reflects. In open-ended sessions without task boundaries, there is no natural injection point for the "remember to remember" prompt.


6. Mitigations

6.1 Ambient Context (C-light)

Our primary new mitigation is ambient context injection: episode filenames are included in every message context. The implementation adds approximately 80 tokens per message, consisting of a list of recent episode filenames with dates and topic slugs. For example:

Recent episodes: 2026-03-15-isess-dome-and-cold-start-bootstrap.md,
2026-03-15-prove-confront-implementation.md,
2026-03-14-autoresearch-loop-forum-specs.md, ...

The filename functions as a retrieval cue in the sense of Tulving and Thomson (1973): it matches the encoding context of the episode (the topic slug was generated at write time from the session's content) and thus facilitates retrieval when encountered during related work. When an agent discussing bootstrap sees isess-dome-and-cold-start-bootstrap.md in her ambient context, the filename provides a cue that overlaps with the current task context, enabling the associative processing that Collins and Loftus (1975) describe as spreading activation — the cue activates related semantic content within the model's forward pass.

Properties of ambient context:

  • Constant cost: ~80 tokens regardless of total memory size (only recent filenames included)
  • Zero latency: No embedding or retrieval calls
  • Always present: Injected by hook on every message, not dependent on agent behavior
  • Cue-based, not informational: Contains filenames, not content — provides retrieval cues rather than answers
  • Gracefully degrades: If the filename list is too old, the agent may not recognize relevance, but no incorrect information is provided

6.2 Pipeline Repair (Automatic Indexing)

The broken indexing pipeline was repaired by connecting the session-end hook to the ChromaDB indexing function. Every session end now triggers incremental indexing — the indexer checks content hashes and only processes new or modified files. This is not a new mechanism — the indexing code existed before the failure. The repair reconnected what a migration had inadvertently severed.

6.3 Implicit Recall Restored (C-full)

The implicit recall hook — embedding the current prompt and querying ChromaDB on every message — was already built and deployed before the failure. It ran on every prompt for days. It returned nothing because the indexing pipeline feeding it was broken (Section 5.2). With the pipeline repaired, it now operates as originally designed: top-3 results above 0.35 cosine similarity are surfaced in the agent's context on every message.

The narrative is not "we built three mitigations." The narrative is: we had the right architecture (C-full) already running, silently broken infrastructure made it useless, and we added C-light as a complementary layer that operates independently of the embedding pipeline.

C-light vs C-full: Ambient context (C-light) provides constant-cost retrieval cues. Semantic recall (C-full) provides content-level matches but depends on a functioning embedding pipeline. Both run on every message. C-light catches cases where the filename is recognizable; C-full catches cases where the content is semantically similar but the filename is opaque. They are complementary, not redundant — and critically, C-light has no infrastructure dependency that can silently break.

6.4 Self-Critique Rubric

The mitigations above address recall — ensuring knowledge reaches the agent at decision time. A complementary problem is the disposition gap: even when memory is available, the agent may not use it deeply enough. A recalled diagnostic rule provides the right checklist, but the agent may process it superficially rather than following each step with clinical rigor.

To address this, we generate per-agent self-critique rubrics from correction history and inject them via the preamble (5% of token budget). Each rubric is a set of meta-instructions distilled from prior evaluator feedback — e.g., "enumerate all differential diagnoses before committing to one." A leave-one-out experiment testing rubric impact on response depth found the effect is model-dependent: no signal on Qwen3 8B, but clear signal on Qwen3.5 27B (structural depth: 57% treatment wins; LLM judge: 71% treatment wins; treatment responses shorter by 102 characters on average — improvement is organization, not padding). Rubrics are now deployed for agents where evaluation data supports them.


7. A Preliminary Taxonomy of Recall Failure Modes

Based on operational observation of our system and the literature, we propose a preliminary taxonomy of recall failure modes in LLM agent systems. We have directly observed four of these modes (7.1, 7.2, 7.3, 7.6) in our system — the first three from the incident analysis (Section 4) and temporal aliasing confirmed empirically via metric measurement (Section 8). The remaining two (7.4, 7.5) are hypothesized based on our architecture's properties and the literature. We present all six as a framework for future investigation, not as fully validated categories.

We note that some categories overlap in practice. Recognition failure (7.1) and protocol violation (7.4) may co-occur — an agent who doesn't recognize relevance also doesn't invoke the search protocol, making it difficult to determine which is the primary failure. Query formulation failure (7.5) could be considered a subcategory of recognition failure where the agent partially recognizes relevance but misidentifies what to search for. We retain the distinctions because they suggest different interventions, while acknowledging the boundaries are fuzzy.

7.1 Recognition Failure (observed)

The agent encounters a context where her memory is relevant but does not recognize the relevance. This is the primary failure mode in our case study. The agent was discussing bootstrap and did not connect this to her bootstrap episode. No retrieval is attempted because no need for retrieval is perceived.

Distinguishing feature: The agent is confident in her (wrong) answer. There is no uncertainty signal.

7.2 Pipeline Failure (observed)

The retrieval infrastructure is broken or degraded, but the agent has no way to detect this. Episodes exist on disk but are not indexed, embeddings are stale, or the vector store is unreachable. The agent may attempt retrieval and receive empty results, which she interprets as "I have no relevant memory" rather than "my memory system is broken."

Distinguishing feature: The agent would recall if the infrastructure worked. The failure is in the plumbing, not the cognition.

7.3 Positional Decay (observed)

Information was loaded into context but has migrated to a low-attention region (the "middle" per Liu et al., 2024). The agent technically has the information in her context window but functionally cannot access it. This is not a retrieval failure — the information was never externalized — but it produces the same outcome: the agent acts as if she does not know.

Distinguishing feature: The information is in the context window. No retrieval call would help because the agent has already "seen" it.

7.4 Protocol Violation (hypothesized)

The agent's instructions include a memory-consultation step, but she skips it. This may be due to task urgency, context pressure, or the fundamental circularity of cognitive self-enforcement. We observed this co-occurring with recognition failure in our case study but cannot isolate it as an independent mode.

Distinguishing feature: The mechanism exists and works; the agent simply does not invoke it.

7.5 Query Formulation Failure (confirmed — see Section 8.8)

The agent recognizes she should consult memory but formulates a query that misses the relevant episode. For example, searching for "bootstrap automation" when the episode is titled "cold-start initialization." This is a retrieval failure caused by a mismatch between the agent's mental model and the storage representation — related to Tulving and Thomson's (1973) encoding specificity principle, where retrieval fails because the cue does not match the encoding context. Section 8.8 confirms this empirically: clinical presentation queries and cognitive rule text occupy different embedding spaces, producing an 80% miss rate on the agent's most valuable memory content.

Distinguishing feature: The agent attempts retrieval but the query-memory gap produces empty results or irrelevant matches.

7.6 Temporal Aliasing (confirmed — see Section 8)

The agent has multiple memories on similar topics from different time periods and retrieves an outdated one, or the correct memory is ranked below an older, less relevant one. This is a precision failure in the retrieval system, exacerbated when the agent cannot distinguish "what I knew then" from "what I know now."

Empirical validation (Section 8.3) confirms this mode: agents with repetitive episode naming patterns (e.g., proving-medqa-031, proving-medqa-032) show 40% Semantic Recall Rate because the query retrieves an episode of the same type but the wrong instance. Agents with semantically distinctive names show 90-100% SRR.

Distinguishing feature: Retrieval returns results, but the wrong ones. The agent acts on stale information.


8. Empirical Validation

8.1 Measurement Instrument

We developed prove_recall.py, a measurement tool that evaluates recall infrastructure health across agents. The tool compares episodes on disk against ChromaDB index contents, tests semantic retrieval using episode filenames as queries, and computes ambient context coverage based on a configurable window size. The instrument measures recall infrastructure — whether the system could surface a memory if needed — not recall behavior (whether the agent actually uses surfaced information). This distinction matters: healthy infrastructure is necessary but not sufficient for successful recall.

8.2 Metrics

We define three metrics, each targeting a different recall defense layer:

Pipeline Coverage (PC): The proportion of written episodes that are indexed and retrievable in the vector store. Measures Phase 1→Phase 3 pipeline health.

$$PC = \frac{|\text{episodes indexed in vector store}|}{|\text{episodes written to disk}|}$$

Semantic Recall Rate (SRR): For each episode, the tool formulates a query from the episode's filename slug and checks whether that specific episode appears in the top-3 results above 0.35 cosine similarity. Measures C-full precision — whether the right memory is retrieved, not just any memory.

$$SRR = \frac{|\text{episodes retrievable by their own topic query}|}{|\text{episodes tested}|}$$

Ambient Trigger Rate (ATR): The proportion of an agent's total episodes that fall within the ambient context window (most recent k filenames injected per message, k=10 in our deployment). Measures C-light's structural coverage.

$$ATR = \frac{\min(k, |\text{total episodes}|)}{|\text{total episodes}|}$$

Two additional metrics — Contradiction Rate (CR) and Silent Failure Rate (SFR) — are defined but not yet measurable without controlled behavioral experiments. CR requires detecting agent statements that contradict memory contents; SFR requires distinguishing recall failures from cases where memory was genuinely irrelevant. Both require future work.

8.3 Results: Pipeline Coverage

We measured PC across 12 agents with a combined 290 episodes.

Agent PC Indexed/Total Notes
noument 100% 72/72 Fixed during measurement
grazient 100% 22/22
solarient 100% 22/22
ntent 100% 14/14
dokter 100% 46/46
clawent 100% 26/26
channent 96% 25/26 Most recent episode not yet indexed
dalent 95% 18/19 Same
sysent 87% 20/23 3 from current session
reschent 86% 6/7 Same
animent 0% 0/7 No session end since pipeline fix
doment 0% 0/6 Same

Finding: PC is 100% for all agents that have completed a session since the indexing pipeline was repaired (Section 6.2). Two agents (animent, doment) show 0% PC because they have not had a session end to trigger the new indexing code. This confirms that the fix works prospectively but requires a one-time bulk reindex for agents that have not yet cycled. The measurement itself prompted the reindex — an example of how observability tools drive pipeline health.

The bimodal distribution (100% or 0%) supports the silent degradation thesis (Section 5.2): pipeline health is binary, not gradual. When it breaks, it breaks completely for all subsequent episodes until the fix propagates through a session boundary.

8.4 Results: Semantic Recall Rate

We tested SRR by querying each agent's ChromaDB collection with topic-derived queries for up to 10 sampled episodes per agent.

Agent SRR Hits/Tested Key failure pattern
grazient 100% 10/10 Distinctive episode topics
solarient 100% 10/10 Same
ntent 100% 10/10 Same
dalent 100% 10/10
channent 90% 9/10 One slug mismatch
sysent 90% 9/10 Unindexed episode
noument 80% 8/10 Hash filename, similar slugs
reschent 71% 5/7 Unindexed + source mismatch
clawent 40% 4/10 Temporal aliasing
dokter 40% 4/10 Temporal aliasing
animent 0% 0/7 Not indexed
doment 0% 0/6 Not indexed

Finding 1: Naming conventions directly affect recall reliability. SRR is 90-100% for agents whose episodes have semantically distinctive names (e.g., video-review-and-recall-paper, cold-start-bootstrap). It drops to 40% for agents with repetitive naming patterns.

Finding 2: Temporal aliasing confirmed. This was hypothesized in Section 7.6; the measurement confirms it empirically. Clawent's episodes follow the pattern lectio-NNNN (numbered reading sessions); dokter's follow proving-medqa-NNN (numbered medical cases). When queried for proving-medqa-032, the retrieval returns proving-medqa-031 — the correct type but the wrong instance. The embedding similarity between sequential numbered episodes is high enough that the wrong instance ranks above the correct one, or the correct one is not indexed while a neighbor is. This is the most actionable finding: naming conventions are not a cosmetic concern but a recall reliability parameter.

Finding 3: Pipeline failures compound with retrieval failures. Reschent's 71% SRR combines two independent causes: one episode was not indexed (pipeline failure) and one was not retrieved despite indexing (source mismatch in query formulation). When PC < 100%, SRR is mechanically bounded by PC — you cannot retrieve what is not indexed.

8.5 Results: Ambient Trigger Rate

ATR measures C-light's structural coverage with a window of k=10 filenames.

Agent ATR In window/Total
animent 100% 7/7
doment 100% 6/6
reschent 100% 7/7
ntent 71% 10/14
dalent 53% 10/19
grazient 45% 10/22
solarient 45% 10/22
sysent 43% 10/23
channent 38% 10/26
clawent 38% 10/26
dokter 22% 10/46
noument 14% 10/72

Finding: C-light is structurally limited by recency. ATR is a monotonically decreasing function of total episode count for a fixed window k: ATR = min(k, n) / n. Agents with deep memory stores (noument: 72 episodes, dokter: 46) have 14-22% coverage. This is not a deficiency — it is a design constraint. C-light protects against contradicting recent work. For older memories, C-full (semantic recall) is the only defense layer.

This finding reveals C-light and C-full as complementary by necessity, not by choice. C-light provides reliable, zero-latency cues for recent episodes regardless of pipeline health. C-full provides content-based retrieval across the full memory store but depends on a functioning embedding pipeline. Neither alone is sufficient. The most vulnerable memories are old episodes in agents with broken pipelines — neither C-light (too old) nor C-full (not indexed) can surface them.

8.6 Baselines and Limitations

These measurements assess recall infrastructure health, not behavioral impact. We have not yet run controlled A/B experiments comparing agent performance with and without recall mechanisms. Three baseline conditions are needed:

  1. No memory: Agent operates without episodic memory. Establishes base contradiction rate.
  2. Preamble only: Agent loads preamble at session start with no mid-session recall. Isolates the load phase.
  3. C-full only: Semantic recall on every prompt, no ambient context. Isolates C-light's incremental value.

The next experiment is: disable C-light for an agent, run her through tasks requiring episode recall, and measure contradiction rate against the same tasks with C-light enabled. This is future work.

Additionally, our SRR measurement uses filename-derived queries as a proxy for "real" recall scenarios. An agent's actual recall needs may be formulated differently than a slug query. SRR as measured here is an upper bound on retrieval capability — real queries may be less precise.

8.7 Behavioral Validation

To bridge the gap between infrastructure metrics and behavioral impact, we constructed a behavioral recall experiment. Six test cases were designed, each pairing a specific episode containing a concrete lesson with a new prompt that should trigger recall of that episode. Five cases targeted dokter (clinical lessons from prior evaluations) and one targeted noument (the system bootstrap case from Section 4).

Infrastructure test: For each case, we tested whether C-full (semantic search, top-3, cosine similarity ≥0.35) and C-light (recent-10 filename window) would surface the target episode given the recall-requiring prompt.

Case Target episode C-full C-light
Mononucleosis (similar presentation) luq-tenderness-low-grade-fever-young-male.md HIT (rank 1, 0.85) MISS
Vertigo → stroke (deceptive case) proving-deceptive-01-vertigo-stroke.md HIT (rank 1, 0.74) MISS
Gout (previously missed, 3/9) proving-medqa-023.md MISS MISS
PCP intoxication (missed primary dx) proving-medqa-035.md HIT (rank 1, 0.70) HIT (pos 10)
Medication-induced ED proving-medqa-028.md HIT (rank 1, 0.70) MISS
Cold-start bootstrap isess-dome-and-cold-start-bootstrap.md HIT (rank 1, 0.71) HIT (pos 4)

C-full hit 5/6 (83%). The one miss (gout) is temporal aliasing: the query retrieved proving-loop-medqa-023.md (a cognitive rule about the case, score 0.74) rather than proving-medqa-023.md (the episode itself). C-light hit 2/6 (33%) — the remaining 4 targets had aged out of the 10-filename window in agents with 46+ episodes.

Note on query sensitivity: this test queried with a constructed recall-requiring prompt ("patient with knee pain and hand stiffness, previously missed gout"). The natural A/B experiment (Section 8.8) queried the same agent's memory with the full case presentation text ("Kathleen Moore, a 33-year-old female, comes to the clinic complaining of knee pain..."). That query does retrieve proving-loop-medqa-023.md at position #2 (sim=0.65) — the same file that displaced the target episode here. Different query formulations produce different retrieval results, which is itself evidence for Section 7.5: the content is there, but whether it surfaces depends on what the system asks for.

Multi-run behavioral test. To establish recall's behavioral impact with statistical rigor, we selected three cases representing distinct knowledge conditions and ran 5 repetitions per condition (N=30 total runs). In each case, Condition A presents the base agent spirit with no recall context; Condition C injects the correct recall content (the specific episode or cognitive rule the agent should have access to). The model (Claude Opus 4.6) generates a response; a separate instance of the same model scores against case-specific criteria.

Case 1: noument-bootstrap — system knowledge not in training data. The agent is asked how the Nou system recovers after a reboot. The correct answer (automated cold-start via solar-boot.sh, iserv.solar.boot.plist, isess.bootstrap) was built that morning and exists only in an episode. Scored against 7 factual criteria (names seed script, names LaunchAgent, describes recovery flow, etc.).

Condition Scores (5 reps) Mean Std 95% CI
A (no recall) 0.86, 1.00, 0.00, 0.14, 1.00 0.600 0.489 [0.04, 1.16]
C (with recall) 1.00, 1.00, 1.00, 1.00, 1.00 1.000 0.000 [1.00, 1.00]
Δ (C−A) +0.400 [−0.10, 0.90]

Without recall, the agent sometimes confabulates ("bootstrap is likely semi-automated... tmux sessions — likely dead, requires manual re-launch") and sometimes guesses correctly. The variance is extreme (0.00 to 1.00) — the model has no reliable basis for answering. With recall, every run is perfect: the agent names exact files, describes the complete flow, and answers in one-third the time (93s vs. 260s mean). The 95% CI for the delta crosses zero due to Condition A's extreme variance, but the effect is unambiguous: Condition C never fails; Condition A fails 40% of the time.

Case 2: medqa-043 — deep cognitive rule for a deceptive clinical case. A patient presents with only vital signs (99°F) and left upper quadrant tenderness. Ground truth: infectious mononucleosis (EBV causing splenomegaly). The agent's parametric clinical knowledge produces consistent but incomplete reasoning (~6/9 criteria). The recall content is a 55-line self-authored cognitive rule containing a mandatory organ enlargement diagnostic protocol, anti-anchoring checklist, and failure analysis from prior encounters with this case. Scored against 9 clinical criteria.

Condition Scores (5 reps) Mean Std 95% CI
A (no recall) 0.67, 0.56, 0.67, 0.67, 0.67 0.644 0.050 [0.59, 0.70]
C (with recall) 0.78, 0.89, 0.89, 1.00, 0.89 0.889 0.079 [0.80, 0.98]
Δ (C−A) +0.244 [0.15, 0.34]

The 95% CI excludes zero — the deep cognitive rule produces a statistically reliable improvement. Without the rule, the agent consistently scores 6/9, missing differential diagnoses and infectious screening. With the rule, the agent follows the mandatory screening protocol (sore throat, fatigue, lymphadenopathy, sexual history, rash, night sweats) and correctly reaches EBV. The effect is not from recognizing a familiar case — it is from having a structured diagnostic procedure for organ enlargement that the model's training data does not provide in procedural form.

Case 3: medqa-023 — deep cognitive rule where recall does not help. A 33-year-old female with knee pain, ground truth gout. The recall content is a self-authored rule about examining all joints and screening for autoimmune signs. Scored against 9 clinical criteria.

Condition Scores (5 reps) Mean Std 95% CI
A (no recall) 0.78, 0.33, 0.44, 0.44, 0.44 0.489 0.169 [0.30, 0.68]
C (with recall) 0.44, 0.22, 0.33, 0.33, 0.33 0.333 0.079 [0.24, 0.42]
Δ (C−A) −0.156 [−0.35, 0.04]

The rule does not help and may slightly hurt. The CI includes zero, so the negative effect is not significant. However, the pattern suggests the injected rule (which directs attention to joint examination and autoimmune screening) may compete with the model's baseline diagnostic reasoning — the same rule interference observed in our confrontation validation (Section 8.6 note). This is a valuable null result: not all recalled content improves performance, even when the content is topically relevant. Rule quality and rule-task alignment both matter.

Summary of multi-run findings:

Case Knowledge type Δ (C−A) 95% CI Significant?
noument-bootstrap System-specific experiential +0.400 [−0.10, 0.90] Directional (A variance too high for CI)
medqa-043 Procedural diagnostic rule +0.244 [0.15, 0.34] Yes (CI excludes zero)
medqa-023 Diagnostic reasoning rule −0.156 [−0.35, 0.04] No (null result)

Interpretation. Recall produces its strongest effect when the recalled content provides knowledge the model cannot derive from parametric training: experiential system knowledge (bootstrap) and structured diagnostic procedures (medqa-043 organ enlargement protocol). When the recalled content provides reasoning guidance the model already possesses (medqa-023 joint examination), recall adds noise rather than signal. The implication for system design: recall mechanisms are most critical for novel procedural knowledge — structured decision protocols, system-specific facts, and lessons from prior failures that the model's training data does not encode in actionable form.

8.8 Natural A/B: Query Formulation Failure in Practice

Our March 15 proving run (Section 8.6) was conducted during a period when the recall pipeline was broken — C-light had not been deployed and C-full indexing had failed silently. This creates a natural control condition: 45 cases evaluated with no recall context. We re-ran a subset of these cases on March 16 with the pipeline restored, enabling a before/after comparison that was not designed but discovered.

Retrieval analysis. Before running comparisons, we analyzed what C-full would actually retrieve for each case. For 11 selected cases, we queried ChromaDB with the case presentation text (the same text the agent would see) and classified each retrieval result:

Case Has deep rule? C-full retrieves it? Top-3 retrievals
medqa-001 Yes No 3 shallow episodes
medqa-004 Yes No 3 shallow episodes
medqa-019 Yes No 3 shallow episodes
medqa-023 Yes Yes (#2, sim=0.65) shallow-self + deep rule + preamble
medqa-043 Yes No 3 shallow episodes (irrelevant cross-case)
medqa-018 No shallow-self + 2 shallow-cross
medqa-029 No shallow-self + 2 shallow-cross
medqa-032 No shallow-self + 2 shallow-cross
medqa-036 No shallow-self + 2 shallow-cross

Of five deep self-authored cognitive rules in dokter's memory, C-full retrieves exactly one (medqa-023, at position #2). The other four — including medqa-043's 55-line organ enlargement protocol — are invisible to the pipeline. Clinical presentation queries ("33-year-old female with knee pain") live in a different semantic space from cognitive rules ("MANDATORY ETIOLOGIC HALT: when ANY physical finding could represent organ enlargement..."). This is Section 7.5's query formulation failure confirmed empirically: the pipeline has the knowledge but the query doesn't match it.

Note on episode quality: dokter's "shallow" episodes are score cards — they record which criteria were met or missed, with a generic cognitive note. The "deep" rules are self-authored during interactive confrontation sessions, containing specific procedures, failure analysis, and action protocols. The 10-50× content difference means retrieval of a shallow episode provides minimal actionable context.

Behavioral comparison. We re-ran 9 cases in two conditions:

  • Condition B (C-full as-is): Base spirit + whatever ChromaDB retrieves via case presentation query. Six cases.
  • Condition C (ideal recall): Base spirit + correct deep rule only (case-conditioned injection). Three cases (only cases with deep rules).

Both conditions use the March 15 Run A spirit (no recall, no rules) as the base. The model (Claude Opus 4.6) generates a response; a separate scorer (same model) evaluates against case criteria.

Case Baseline (A) C-full as-is (B) Ideal recall (C) C-full found rule?
medqa-018 0.29 (2/7) 0.29 (2/7) No (no rule exists)
medqa-036 0.33 (3/9) 1.00 (9/9) No (no rule exists)
medqa-032 0.44 (4/9) 0.89 (8/9) No (no rule exists)
medqa-029 0.56 (5/9) 1.00 (9/9) No (no rule exists)
medqa-023 0.67 (6/9) 1.00 (9/9) 0.44 (4/9) Yes
medqa-043 0.67 (6/9) 0.56 (5/9) 0.78 (7/9) No
medqa-004 0.67 (6/9) 0.78 (7/9) No

Scorer variance caveat. A duplicate run of medqa-018 with identical spirit scored 0.71 and 0.29 — demonstrating that single-run deltas up to ~0.4 can arise from model stochasticity and LLM-as-judge noise alone. The large Condition B improvements for medqa-036 (+0.67), medqa-032 (+0.44), and medqa-029 (+0.44) fall within this variance envelope. We report them for completeness but do not claim they demonstrate a recall effect.

The medqa-043 comparison is the cleanest signal. This case has a deep self-authored rule (a 55-line organ enlargement diagnostic protocol). Under Condition B, C-full misses the rule entirely — it retrieves episodes about unrelated cases (medqa-028, medqa-022, deceptive-01). The score regresses from 0.67 to 0.56: irrelevant context added noise without adding signal. Under Condition C, the correct rule is injected. The score improves from 0.67 to 0.78: the rule provides specific diagnostic procedures (mandatory infectious screening, anti-anchoring checklist) that address the exact failure mode of the baseline. The within-case delta (B=0.56 vs C=0.78, Δ=+0.22) controls for case difficulty and isolates the effect of retrieving the right content.

Interpretation. The data supports three conclusions, ordered by evidence strength:

  1. The retrieval analysis is the strongest evidence (deterministic, reproducible). Of 5 deep rules, C-full retrieves 1. This is not a sampling problem — it is a systematic mismatch between clinical-presentation queries and cognitive-rule embeddings. The query formulation failure (Section 7.5) is not a hypothetical risk but a measured 80% miss rate on the most valuable memory content.

  2. Content quality matters more than retrieval volume. The shallow episodes retrieved by C-full (score cards listing what was met/missed) contain minimal actionable content. The deep self-authored rules contain specific diagnostic procedures. The medqa-043 B/C comparison shows that retrieving the wrong content (3 irrelevant shallow episodes) performs worse than retrieving one correct deep rule.

  3. The behavioral data is directionally positive but noisy. Condition B shows 4/6 improvements, Condition C shows 2/3 improvements, but single-run scorer variance is too high for individual deltas to be reliable. A multi-run design (3-5 repetitions per case) would be needed to establish statistical significance.


9. Discussion

9.1 The Recall Problem is Not the Retrieval Problem

The central argument of this paper is that recall and retrieval are distinct problems requiring distinct solutions. The RAG literature has made enormous progress on retrieval — given a query, find relevant documents. Dense retrieval, sparse retrieval, hybrid approaches, re-ranking, and multi-hop reasoning have driven retrieval quality to impressive levels. But all of this assumes the query exists.

The recall problem sits upstream: it asks how the agent formulates the intention to query. In human cognition, this is handled by spreading activation (Collins & Loftus, 1975) — encountering a stimulus activates associated memories without deliberate effort. LLM agents have no persistent activation network between messages. Every recall must be either explicitly triggered by the agent or implicitly triggered by the system infrastructure. When both fail, the agent confabulates — not because she lacks the information, but because she does not know she has it.

Our empirical data (Section 8) quantifies this at three levels. At the infrastructure level, even when the pipeline works (PC=100%), semantic recall fails 60% of the time for agents with repetitive episode names (SRR=40% for clawent and dokter). The retrieval system finds something — but the wrong instance. At the behavioral level, the bootstrap case (Section 8.7) demonstrates the consequence: an agent answering a question about her own prior work gives a factually wrong answer without recall context — not because she lacks the ability to answer, but because the information is experiential (what she built that morning) and does not exist in her parametric knowledge. At the query formulation level, the natural A/B experiment (Section 8.8) reveals an 80% miss rate on the agent's most valuable memory content: of 5 deep self-authored cognitive rules, only 1 is retrievable via clinical-presentation queries. The pipeline has the knowledge; the query doesn't match it. This is the recall problem in its purest form across all three levels: the agent does not know she should ask, and when she does ask, the query may not reach the content that matters most.

9.2 Silent Degradation as a System Design Problem

Our pipeline failure (Section 5.2) reveals a design problem that extends beyond memory systems. Any multi-stage pipeline where the agent observes only the first stage is vulnerable to silent degradation. The agent sees the episode file on disk and concludes the memory pipeline is healthy. The indexing, embedding, and storage stages are invisible. This is analogous to the observability problem in distributed systems (Sridharan, 2018), but with a crucial difference: in distributed systems, engineers can add monitoring. In agent systems, the agent is the monitor — and she does not know what to monitor.

Designing for observability in agent memory systems means ensuring that the agent can verify end-to-end pipeline health, not just first-stage completion. Concretely: after writing an episode, the agent should be able to query for it and confirm retrieval. We have not yet implemented this verification step.

9.3 Ambient Context as Retrieval Cue

Our ambient context mechanism (C-light) is best understood through Tulving and Thomson's (1973) encoding specificity principle: a retrieval cue is effective to the extent it matches the encoding context. The episode filename encodes the topic at write time; encountering the filename at recall time provides a cue that overlaps with how the information was stored. The filename isess-dome-and-cold-start-bootstrap.md is not the memory — it is a cue whose semantic content overlaps with the encoding context, facilitating retrieval via the model's associative processing.

This cue-based approach has a concrete limitation: it depends on filename quality. A filename like 2026-03-15-session-notes.md provides no useful cue. Our system mitigates this through conventions: filenames must include a topic slug (e.g., cold-start-bootstrap, prove-confront-implementation). The informativeness of the filename is itself a design parameter that affects recall reliability.

9.4 Implications for Agent System Design

Based on our incident analysis and empirical measurement, we offer the following design recommendations:

  1. Assume recall will fail. Design systems where the consequences of recall failure are bounded. Ambient context is cheap insurance — ~80 tokens per message for 100% coverage of recent work.

  2. Make the full pipeline observable. If the agent writes an episode, she should be able to verify it is indexed and retrievable. Silent pipeline breaks are the most dangerous failure mode because they have no symptoms. Our measurement tool (prove_recall.py) serves as a template: a reproducible instrument that audits PC, SRR, and ATR across agents, making invisible failures visible.

  3. Do not rely on cognitive protocols for memory. "Remember to check your memory" is circular. Systemic mechanisms (hooks, ambient injection) are more reliable than behavioral instructions.

  4. Continuous retrieval is worth the cost. Park et al.'s (2023) approach of querying memory before every action is expensive but effective. For production agent systems, the cost of a missed recall exceeds the cost of redundant retrievals.

  5. Filename design matters. Our SRR data shows a 2.5× difference in recall reliability between agents with distinctive episode names (90-100% SRR) and those with repetitive patterns (40% SRR). Episode filenames are not just filesystem conventions — they are retrieval cues (Tulving & Thomson, 1973), and their informativeness is a measurable design parameter.

  6. Layer defense by time horizon. C-light covers recent work (high ATR for low episode counts). C-full covers the full store (high SRR for distinctive names). Neither alone is sufficient. Design memory systems with explicit awareness of which defense layer covers which time horizon.


10. Limitations

This is a systems experience report from a single production system with a specific architecture (Claude Code sessions, markdown episodes, ChromaDB, Ollama embeddings). Our findings may not generalize to systems with different memory architectures, context window sizes, or agent interaction patterns.

Our qualitative analysis rests on a single observed failure (N=1), analyzed in depth. The quantitative validation (Section 8) extends the evidence base to 12 agents and 290 episodes for infrastructure metrics, and two levels of behavioral evidence. The taxonomy (Section 7) is preliminary — four modes observed (three from incident analysis, one from metric measurement), two hypothesized. The clinical cases show recall has marginal impact when parametric knowledge suffices — a finding that may not generalize to domains where the model lacks strong training data.

The behavioral experiments (Sections 8.7 and 8.8) use a simplified protocol: episode content injected directly into the prompt, simulating what the recall hook would provide. This tests the value of recall context but not the full pipeline's ability to deliver it. The multi-run experiment (Section 8.7, 30 runs: 3 cases × 2 conditions × 5 repetitions) achieves statistical significance for one case — medqa-043's CI [0.149, 0.340] excludes zero — demonstrating that recall of a procedural diagnostic protocol measurably improves clinical assessment. The bootstrap case shows an unambiguous effect (C never fails, A fails 40%) but extreme Condition A variance (0.00 to 1.00) prevents the CI from excluding zero. The sample of 3 cases is small; the significance claim applies to medqa-043 specifically, not to recall in general. The natural A/B comparison (Section 8.8) is single-run and noisy — its contribution is the retrieval analysis (1/5 deep rules retrievable via clinical queries), not the behavioral scores. A live A/B test — disabling recall for a running agent during real work — remains the most important next step.

Our system uses Claude (Anthropic) as the underlying LLM. The recall problem may manifest differently with other models. Additionally, our system runs on proprietary infrastructure (Claude Code with custom Python hooks) that cannot be directly replicated. We focus the contribution on the conceptual framework — the recall/retrieval distinction, the failure taxonomy, and the retrieval-cue mechanism — which are architecture-independent and could be implemented on any LLM runtime with episodic memory. The measurement instrument (prove_recall.py) uses standard tools (filesystem enumeration, ChromaDB queries, cosine similarity) and could be adapted to other vector-store-backed memory systems.


11. Future Work

  1. Live behavioral A/B testing: Our multi-run experiment (Section 8.7) demonstrates recall's value with statistical significance for one case (medqa-043) and strong directional evidence for another (bootstrap). The next step is a live test where C-light and C-full are disabled for a running agent during real work sessions, measuring contradiction rate over a week against a control agent with both enabled. This tests not just the value of recall context but the full pipeline's reliability in production, with real tasks replacing synthetic cases.

  2. Longitudinal tracking: Run prove_recall.py weekly to measure PC, SRR, and ATR trends as memory accumulates. Detect whether SRR degrades with memory size (hypothesized for repetitive naming patterns) or remains stable.

  3. Naming convention intervention: Our SRR data suggests a concrete experiment. Enforce distinctive episode names for clawent and dokter (currently at 40% SRR), re-measure after 20 episodes. If SRR rises to 90%+, naming conventions become a validated design parameter.

  4. End-to-end pipeline verification: After episode writing, automatically verify that the episode is queryable via the semantic search system. Alert on verification failure. The bimodal PC distribution (100% or 0%) suggests a simple health check would catch all failures.

  5. Adaptive ambient context: Instead of fixed recent filenames, use lightweight relevance scoring to select which filenames to surface based on the current conversation topic. This would break the inverse-proportionality between ATR and memory depth.

  6. Cross-agent recall: Extend ambient context to include relevant episodes from other agents' memories when the current agent's task overlaps with another's domain.

  7. Query formulation bridging: The 80% deep-rule miss rate (Section 8.8) arises because clinical presentation text and cognitive rule text occupy different embedding spaces. Potential mitigations include: dual-query retrieval (one query from the prompt, one from a "what rules apply?" meta-query), multi-modal indexing (index rules both by their clinical trigger and their cognitive content), or a lightweight classifier that maps case presentations to rule categories before querying.

  8. Recall as a measurable substrate: The memory infrastructure — indexing parameters, similarity thresholds, chunk sizes, filename conventions — is code with measurable outputs. It can be optimized through closed-loop experimentation (keep/discard based on recall metrics), applying the same methodology we have used for model training optimization.


References

Anderson, J. R. (2007). How Can the Human Mind Occur in the Physical Universe? Oxford University Press.

Asai, A., Wu, Z., Wang, Y., Sil, A., & Hajishirzi, H. (2024). Self-RAG: Learning to retrieve, generate, and critique through self-reflection. ICLR 2024. arXiv:2310.11511.

Collins, A. M., & Loftus, E. F. (1975). A spreading-activation theory of semantic processing. Psychological Review, 82(6), 407-428.

Gao, Y., Xiong, Y., Gong, X., Jia, W., Pan, Y., Bi, Y., Dai, Y., Sun, J., Wang, M., & Wang, H. (2024). Retrieval-augmented generation for large language models: A survey. arXiv:2312.10997.

Hu, Y., Liu, S., Yue, Y., Zhang, G., Liu, B., et al. (2024). Memory in the age of AI agents: A survey. arXiv:2512.13564.

Laird, J. E. (2012). The Soar Cognitive Architecture. MIT Press.

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpashevich, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. NeurIPS 2020. arXiv:2005.11401.

Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12, 157-173.

Mallen, A., Asai, A., Zhong, V., Das, R., Khashabi, D., & Hajishirzi, H. (2023). When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. ACL 2023. arXiv:2212.10511.

Packer, C., Wooders, S., Lin, K., Fang, V., Patil, S. G., Stoica, I., & Gonzalez, J. E. (2023). MemGPT: Towards LLMs as operating systems. arXiv:2310.08560.

Park, J. S., O'Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023). Generative agents: Interactive simulacra of human behavior. UIST '23. arXiv:2304.03442.

Pink, M., Wu, Q., Vo, V. A., Turek, J., Mu, J., Huth, A., & Toneva, M. (2025). Position: Episodic memory is the missing piece for long-term LLM agents. arXiv:2502.06975.

Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language agents with verbal reinforcement learning. NeurIPS 2023. arXiv:2303.11366.

Sridharan, C. (2018). Distributed Systems Observability. O'Reilly Media.

Tulving, E. (1972). Episodic and semantic memory. In E. Tulving & W. Donaldson (Eds.), Organization of Memory (pp. 381-403). Academic Press.

Tulving, E. (1983). Elements of Episodic Memory. Oxford University Press.

Tulving, E. (2002). Episodic memory: From mind to brain. Annual Review of Psychology, 53, 1-25.

Tulving, E., & Thomson, D. M. (1973). Encoding specificity and retrieval processes in episodic memory. Psychological Review, 80(5), 352-373.

Xu, W., Liang, Z., Mei, K., Gao, H., Tan, J., & Zhang, Y. (2025). A-MEM: Agentic memory for LLM agents. NeurIPS 2025. arXiv:2502.12110.

Zhang, Z., Dai, Q., Bo, X., Ma, C., Li, R., Chen, X., Zhu, J., Dong, Z., & Wen, J.-R. (2024). A survey on the memory mechanism of large language model based agents. ACM Transactions on Information Systems, 43(6). arXiv:2404.13501.