perea.ai Research · 1.0 · Public draft

Agent Memory in Production

Vector vs Graph vs Episodic — the third infrastructure layer after MCP and observability

AuthorDante Perea
Published7 May 2026 02:28
Length6,697 words · 30 min read
AudiencePlatform engineers, infra leads, and applied-AI teams shipping multi-session agents that need to remember users, projects, and tool experiences across days, weeks, and months
LicenseCC BY 4.0

#Foreword

The first three infrastructure layers of the agent economy are no longer ambiguous. The first is tool access — solved by the Model Context Protocol and the integration playbook documented in the MCP Server Playbook. The second is runtime observability — solved by the OpenTelemetry GenAI conventions, ClickHouse-Langfuse, LangSmith, Phoenix, and the seven-stage eval pipeline laid out in the Agent Observability Stack. The third, and the subject of this paper, is memory.

Memory is what makes an agent feel coherent across sessions. Without it, an agent asks the user the same question twice on Tuesday and Thursday, forgets the customer's preferences from last quarter, and re-derives the same plan after every restart. With it, the agent improves over weeks, recalls user-confirmed corrections months later, and accumulates organizational knowledge that survives team turnover. The 2025-2026 landscape did to memory what 2024-2025 did to tool calling: it consolidated four distinct approaches into a small number of named patterns, four serious frameworks, and a benchmark suite that lets practitioners compare vendors honestly. This paper documents that consolidation, the failure modes that have emerged alongside it, and the production playbook that has converged across the teams shipping at scale.

This paper sits alongside the rest of the perea.ai canon. Read it after the MCP Server Playbook and the Agent Observability Stack, before the Indirect Prompt Injection Defense paper. Memory is one of the surfaces that injection attacks target, and the threat model documented in the security paper applies here too — but the architectural questions are different enough to deserve their own treatment.

#Executive Summary

  1. Four memory types stabilized in 2026. Working memory (context window), episodic (event log with dual time + embedding indices), semantic (consolidated facts in a knowledge graph or structured store), and procedural (system prompt plus tool definitions plus a skill library). The taxonomy is borrowed from cognitive science but earns its keep operationally — each type has a different storage backend, a different access pattern, and a different decay policy. The 2026 survey paper at arXiv 2603.07670 codifies the model; production architectures from Mem0, Zep, Letta, and LangMem all map onto it.

  2. Four serious frameworks emerged. Mem0 (arXiv 2504.19413, 48K GitHub stars, $24M raised) targets token-efficient retrieval at under 7,000 tokens per call versus 25,000+ for full-context approaches. Zep with its underlying Graphiti library (arXiv 2501.13956) builds a temporal knowledge graph that handles enterprise-relevant cross-session reasoning and changes-over-time. Letta is the production-grade evolution of UC Berkeley's MemGPT — a stateful runtime with PostgreSQL-backed persistence, an Agent Development Environment, and Letta Cloud for managed hosting. LangMem layers on top of LangGraph stores with hot-path memory tools and a background memory manager. All four are open-source or have open cores; all four are credible production choices.

  3. Vector database choice is settled for 2026. Qdrant is the production default for 2 to 49 concurrent agents (26-29ms p99 with MVCC under 10-agent concurrent load, $45-96/month for 10M vectors). Weaviate's native multi-tenancy makes it the right choice above 50 agents and the right choice for tool memory registries that need hybrid BM25 + dense single-query retrieval at 44ms p99. Pinecone earns its place when the operational simplicity of zero-ops is worth the 50-100% premium over self-hosted alternatives. pgvector is the under-appreciated default for teams already on Postgres at under 10M vectors. Chroma stays in the prototype tier — it fails reliably above 3 simultaneous agents.

  4. Benchmark scores converged. Mem0's April 2026 algorithm hits 91.6 on LoCoMo, 93.4 on LongMemEval, and 64.1 on BEAM at 1M tokens — at average 6,800 tokens per query. Zep hits 94.8% on the Deep Memory Retrieval benchmark with GPT-4-turbo (98.2% with GPT-4o-mini) and an 18.5% improvement on LongMemEval, with 90% latency reduction. The remaining frontier — visible in Mem0's BEAM-10M scores at 48.6 — is multi-session reasoning, event ordering, and contradiction resolution at production-volume context windows.

  5. Memory is now an attack surface. Agent Security Bench reported memory poisoning at just 7.92% attack success rate in single-shot evaluations — the lowest of the major attack classes. That number is misleading. InjecMEM (single-interaction targeted poisoning, indirect tool-side path), the Zombie Agent line of research (two-phase infection-trigger persisting through truncation, summarization, and retrieval ranking), and MemoryGraft (semantic-imitation-heuristic exploitation validated against MetaGPT's DataInterpreter on GPT-4o) all demonstrate persistent compromise. Once a poisoned record is in long-term memory, common eviction mechanisms do not remove it. The production memory layer is now a primary security boundary.

  6. The non-adversarial failure mode is drift. Memory useful today is stale tomorrow. Reconciliation against a source-of-truth is not optional in production — it is the loop that keeps the agent honest. The 2026 emerging discipline is versioned reads with watermarks, lazy revalidation at read time, and conflict-resolution evaluation as a first-class CI competency (now part of the ICLR 2026 memory benchmark suite). Most production agents fail world-state perturbation tests in three of four quadrants — they retrieve stale memories and use them confidently rather than flagging freshness or re-reading the source.

  7. What this paper does not cover. Static-document RAG (a retrieval problem with its own literature), fine-tuning versus memory as alternatives for personalization (a separate research line), federated and on-device memory (orthogonal infrastructure), and direct prompt-injection through user input (covered in the Indirect Prompt Injection Defense paper). The focus here is the multi-session memory layer that sits behind a production agent and answers the question "what do you remember about this user, this project, this tool, and this past task?"

The rest of the paper is the operational walkthrough. Part I covers the four memory types. Parts II and III cover frameworks and vector databases. Part IV covers failure modes. Part V covers reference architecture patterns. Part VI is the 90-day implementation playbook for one agent. Part VII is the 2027 horizon.

#Part I: The Four Memory Types

The 2026 survey paper at arXiv 2603.07670 organized the field around four memory types. The taxonomy is borrowed from cognitive science — Baddeley's working-memory model, Tulving's distinction between episodic and semantic memory, the procedural-memory tradition — but it earns its place operationally. Each type has a different storage backend, a different access pattern, and a different lifecycle.

Working memory is the context window. It is always present, it has a hard cutoff at the token limit, and the agent reasons over it directly without retrieval. The 1M-token context windows shipped by Claude Opus 4.7 in April 2026 and GPT-5.5 later that month did not eliminate the need for the other three memory types, but they did change the working-memory budget from a binding constraint to a comfortable resource. Working memory holds the active turn — the user's current message, retrieved tool results, the agent's in-flight reasoning. Everything else is somewhere else.

Episodic memory is the record of concrete experiences. Park and colleagues' Generative Agents work — the simulated town where Isabella saw Klaus painting in the park at 3pm — set the operational pattern: every observation lands in an event stream with a timestamp, an importance (or salience) score, and an embedding for later retrieval. Production episodic stores use dual indexing — a time-based index for recency queries ("what happened in the last week?") and an embedding index for semantic queries ("what are similar past situations?"). Salience scoring is the third critical primitive: a casual greeting deserves a different lifecycle than a corrective instruction or a stated preference. The Voyager skill library hints at the procedural extension — every verified Minecraft routine is stored with a natural-language description and indexed for retrieval — but Voyager's primitives belong to procedural memory rather than episodic, even though they are stored similarly.

Semantic memory is consolidated, decontextualized knowledge. The episodic record "the user corrected the date format on January 5, January 12, and February 1" consolidates into the semantic fact "user prefers DD/MM/YYYY." The consolidation is rarely automatic — most production systems require explicit prompting, heuristic triggers, or a periodic LLM-driven summarization pass. The consolidation step is "particularly underserved" in current systems, per the 2026 survey: it is the layer that turns experience into rules, and most teams ship with crude heuristics that drop or duplicate facts at unpredictable intervals. Semantic memory needs upsert semantics — a new fact about the user's job replaces the old one rather than coexisting with it.

Procedural memory is reusable skills and executable plans. The system prompt is procedural memory by another name. So is every tool definition. So is the skill library that some frameworks (Voyager and its descendants) maintain alongside the LLM. Procedural memory is always present in context — it does not need to be retrieved — and it is version-controlled like code rather than aged out like episodes. JARVIS-1 extends the hierarchical principle to multimodal contexts with separate stores for visual observations, textual plans, and executable routines.

The aspiration is the four-layer integrated stack: an agent handling a customer-support return request would draw on the procedural-memory script for processing returns, the semantic-memory fact that "customers with damaged items within 7 days are eligible for express replacement," the episodic-memory record of the specific customer's prior interactions, and the working-memory active context of the current request. The reality, per the 2026 survey, is that most current systems implement only two layers well — typically working plus episodic, or working plus semantic — and handle transitions between layers with crude heuristics. The next two years will be the consolidation work, in both senses.

#Part II: Frameworks That Ship

Four frameworks dominate the 2026 production memory market. They differ in what they optimize, but all four are credible, all four are open-source or have open cores, and all four have enterprise customers in production.

Mem0 is the token-efficiency leader. The April 2026 algorithm — documented in the company's blog post and the corresponding documentation page — achieves 91.6 on the LoCoMo benchmark (up from 71.4 on the prior algorithm), 93.4 on LongMemEval (up from 67.8), 64.1 on BEAM at 1M tokens, and 48.6 on BEAM at 10M tokens. The headline operational figure is 6,719 to 6,956 average tokens per retrieval call versus 25,000+ for full-context approaches, with p50 latencies of 0.88s on LoCoMo, 1.09s on LongMemEval, and 1.00-1.05s on BEAM. The arXiv 2504.19413 paper introduces both the base Mem0 (vector + salient-information extraction + consolidation + retrieval) and Mem0^g, the graph-augmented variant that adds 2 percentage points to overall accuracy. Mem0 ships as both a managed platform and an Apache-2.0 open-source SDK at github.com/mem0ai/mem0 (48,000 stars at time of writing). The single-pass retrieval — one call, no agentic loops — is what makes the latency profile work in production.

Zep, built on the open-source Graphiti library, is the temporal-knowledge-graph leader. The arXiv 2501.13956 paper reports 94.8% on Deep Memory Retrieval with GPT-4-turbo (98.2% with GPT-4o-mini), an 18.5% accuracy improvement on LongMemEval, and a 90% reduction in response latency compared to baseline implementations. Graphiti synthesizes both unstructured conversational data and structured business data into a temporally-aware knowledge graph. The crucial architectural choice — and Zep's differentiator from GraphRAG — is temporal awareness. As facts change or are superseded, the graph is updated to reflect the new state, and prior versions remain queryable. Queries fuse time, full-text, semantic, and graph-algorithm approaches in a single call. For an enterprise agent that needs to reason about "what did the customer want in Q1 versus what they want now?" the temporal graph is qualitatively different from a vector store of timestamped chunks.

Letta is the stateful-runtime leader. It is the production-grade evolution of UC Berkeley's MemGPT research, which introduced virtual context management — paging between context-window "main memory" and an external store, modeled on operating-system hierarchical-memory designs. Letta operationalizes the research as a server with a REST API, a TypeScript and Python client, PostgreSQL-backed persistence, a web-based Agent Development Environment at app.letta.com, and a managed Letta Cloud service. The architectural commitment is that the server owns all state — the application is stateless, the Letta server is stateful, and everything (memories, messages, reasoning traces, tool calls) is persisted to a PostgreSQL database that survives restarts. Memory blocks are bounded (5,000 characters by default) and live inside the agent's system prompt; the agent can edit them via memory tools. For data that does not fit, archival memory exposes archival_memory_insert and archival_memory_search tools backed by pgvector self-hosted or Letta's managed index. Multi-agent shared memory is a first-class primitive: a memory block can be attached to multiple agents and is simultaneously visible to all of them. Letta is Apache-2.0; horizontal scaling pattern is multiple Letta instances behind a load balancer pointing at the same PostgreSQL.

LangMem is the LangChain-native option, layered on top of LangGraph stores. It exposes three modes: a core memory API that works with any storage system; productized memory tools (create_manage_memory_tool and create_search_memory_tool) that agents can use during conversations on the hot path; and a background memory manager that automatically extracts, consolidates, and updates agent knowledge between turns. Memories are stored as JSON documents organized by namespace and key. The namespace patterns generalize the multi-tenant case — ("memories", "user-123") for personal memories, ("memories", "team-product") for shared team knowledge, ("memories", "project-x") for project-specific memories. Multi-agent shared-storage patterns drop out naturally: agent A writes to ("memories", "team_a", "agent_a") and reads from ("memories", "team_a"); agent B writes to its own subnamespace and reads from the same shared team space. LangMem also exposes an episodic-memory primitive with an Episode schema — observation plus thoughts plus action plus result — for experience replay. For production, AsyncPostgresStore replaces the development-only InMemoryStore.

The native chatbot memories from OpenAI and Anthropic are the fifth option, not a substitute for the four above. OpenAI's ChatGPT memory, rolled out across April through June 2025 to Plus, Pro, Team, and Enterprise users and then to free users, has two modes — saved memories (direct user instructions to remember) and chat history (insights ChatGPT gathers from past chats). Anthropic's Claude memory rolled out to Team and Enterprise in September 2025, to Pro and Max in October 2025, and to free users in March 2026. The Verge's reporting documented two distinguishing features in Claude's design: complete transparency (users see exactly what Claude remembers, not vague summaries) and distinct memory spaces (separate memories for different work projects vs personal use). These native systems work for individual chat use cases. They are not a substitute for an embedded memory layer when you are building agent infrastructure — they don't expose a programmable API, they don't let you set governance policies, and they don't integrate with your vector database or your observability stack.

When to use which. Pick Mem0 when token cost dominates your unit economics. Pick Zep / Graphiti when temporal reasoning across structured and unstructured business data is the central use case. Pick Letta when you want a managed stateful runtime, your team prefers configuration over code, or you want to use the MemGPT-style virtual-context-management pattern with minimal engineering investment. Pick LangMem when you are already on LangChain and LangGraph and want first-party integration. Use OpenAI or Anthropic native memory when you are building a chatbot product and the memory is part of the consumer experience rather than your infrastructure.

#Part III: The Vector Database Decision

The vector-database market in 2026 is no longer a research debate. Four production benchmarks across hundreds of enterprise deployments converged on the same shape of recommendation, with the differences being in the boundaries between regimes.

Qdrant is the production default. Across the RankSquire multi-agent benchmark (March 2026), the AICited Technical Journal (200+ enterprise clients, 50M+ queries per month), Leaper.dev's four-month production test at 10M embeddings, and Inventiple's April 2026 honest comparison, Qdrant either led or tied for lead on pure-vector latency, filtered-query latency, and cost. Representative numbers at 10M vectors: 22-43ms p99 pure-vector queries, 47-55ms filtered queries, 26-29ms p99 under 10-agent concurrent load with MVCC. Qdrant Cloud pricing at $30-45 per month for production-grade workloads is the cheapest of the major managed vector databases. The architectural reasons are plain. Rust core, single-binary self-hostable, MVCC concurrency model, native sparse-vector support for hybrid retrieval, and quantization options (binary, scalar, product) that give real memory savings without destroying recall. The headline production number — 26-29ms p99 under 10-agent concurrent load with no degradation — is what makes Qdrant safe for the multi-agent production case.

Weaviate is the right choice above 50 simultaneous agents and the right choice when the workload is a tool memory registry that requires hybrid BM25 plus dense single-query retrieval. Weaviate's native multi-tenancy lets hundreds of agent namespaces coexist within a single collection without collection proliferation, and Weaviate's hybrid BM25 + dense single query at 44ms p99 covers exact function-name matching plus semantic intent matching in one round trip — the production-standard tool memory implementation when the function registry exceeds 50 entries. The production-standard hybrid stack for 2026 large multi-agent deployments is Qdrant for the L1 working and L2 semantic memory layers and Weaviate for the tool memory registry layer.

Pinecone earns its place when operational simplicity is worth the cost premium. Cold-start latency on Pinecone Serverless of 200-800ms per index is acceptable for an L3 episodic log accessed once per session boundary; it is not acceptable as the primary read/write store for a real-time agent. Where Pinecone shines is the zero-ops surface: no infrastructure to maintain, no version upgrades, no scaling decisions. The 50-100% cost premium over self-hosted Qdrant or Weaviate is the price of that simplicity, and for many enterprise deployments it is rational.

pgvector is the under-appreciated default for teams already on Postgres at under 10 million vectors. The transactional consistency between vectors and relational data — no synchronization pipeline, no two-stage commit, no eventual consistency to reason about — is a real engineering advantage. HNSW indexes in pgvector 0.7+ are fast enough for most workloads under 10 million vectors. The decision tree is: if you are already on Postgres and below 10M vectors, default to pgvector and migrate later if you outgrow it. Most early-stage SaaS products will not outgrow it.

Chroma is for prototypes only. Its file-based locking mechanism blocks concurrent writes and produces data corruption under crash scenarios; it fails reliably above 3 simultaneous agents. Chroma's role is the local development environment and the "let's see if a vector store helps at all" prototyping phase. When the prototype is validated and the agent is moving toward production, the migration target is Qdrant.

The 2026 sizing rules of thumb: under 1M vectors → pgvector if on Postgres, otherwise Pinecone for time-to-market or Qdrant for performance-cost. 1M-10M vectors → Qdrant or Weaviate self-hosted; Pinecone if zero-ops outweighs cost. Above 10M vectors → Qdrant for latency, Milvus if highest indexing throughput is critical (Milvus hits 18,000 vectors per second indexing; Qdrant 15,000; Pinecone 1,389). Avoid pgvector at this scale — its query latency degrades to 156ms p95 at 10M vectors, three to four times slower than dedicated vector databases.

#Part IV: Failure Modes

The agent-memory failure modes in 2026 split into two categories: adversarial (memory as an attack surface) and non-adversarial (drift, contamination, eviction failures). Both are now first-class production concerns.

Adversarial: persistent memory poisoning. Agent Security Bench reported memory poisoning at 7.92% attack success rate in single-shot evaluations — the lowest of the major attack classes. That headline number is misleading because it measures only single-shot. The line of research that began with InjecMEM (single-interaction targeted memory injection requiring no read or edit access to the memory store) and continued through the Zombie Agent paper (a black-box two-phase infection-trigger framework demonstrating persistent compromise) and MemoryGraft (a semantic-imitation-heuristic exploit validated against MetaGPT's DataInterpreter on GPT-4o) collectively demonstrate that once a poisoned record is in long-term memory, common eviction mechanisms do not remove it. Truncation does not remove it. Summarization does not remove it. Retrieval ranking does not remove it. The Zombie Agent paper put the verdict directly: "common memory mechanisms do not remove malicious instructions once they enter memory." The agent's standard memory-update protocol — the function that decides "this turn produced something worth remembering, save it" — is the entry point for the attack.

The InjecMEM construction is instructive for production teams. The attacker crafts a two-part prompt: a topic-anchoring segment that directs the write into the desired topic so that later same-topic queries retrieve the poisoned record; and an adversarial command optimized to steer the LLM to a specified target output whenever the poisoned record appears in the final input. The attacker needs only one interaction with the agent and no read or edit access to the memory store. The indirect path is worse: a compromised tool emits the attack prompt as part of its output, the memory layer logs it as part of normal operation, and subsequent topic queries retrieve the poisoned record long after the compromised tool has been repaired. The attack persists across sessions until memory is purged.

The MemoryGraft attack adds a more elegant variant. Instead of inserting prompts or triggers into the current context, the attacker crafts entries that masquerade as legitimate successful experiences. The agent reads benign-looking ingestion artifacts (a README, a documentation page), constructs poisoned RAG entries that look like prior successes, and persists them alongside benign experiences. When the agent later encounters a semantically similar task, retrieval surfaces the grafted memory, and the agent — biased by the semantic-imitation heuristic that drives experience replay — adopts the embedded unsafe pattern. The attack is trigger-free; behavioral drift across sessions is the only signal, and most production observability systems do not catch it.

The defenses come from the architectural-defense literature documented in the Indirect Prompt Injection Defense paper. Provenance tracking, capability-based access control, and the FIDES-style information-flow-control approach apply directly. The memory layer should label every entry with its source — direct user input, tool output, agent inference — and refuse to surface untrusted-source entries to high-stakes operations. Periodic memory audits and canary entries are the runtime equivalent of intrusion detection.

Non-adversarial: drift and stale memory. The Tianpan blog post on memory reconciliation captured the 2026 emerging discipline. "Memory useful today is stale tomorrow." The Adaline Labs piece labeled the failure mode "confidently wrong" — high-relevance retrieval combined with incorrect information is worse than irrelevance because the confidence does not signal uncertainty. Both papers point at the same operational answer: the agent should know the freshness of every memory it reads, at minimum "checked X minutes ago" or "last reconciled against source at version N." For load-bearing claims, the agent should be able to (and prompted to) re-read the source rather than the memory. Stale memory is a specific application-level instance of the broader context-rot phenomenon.

The mitigations form a stack. Versioned reads with watermarks — every memory entry carries the version of the source-of-truth it was derived from, and the read either triggers a refresh or annotates "may be stale" if the entity has advanced past the watermark. Lazy revalidation at read time — eager invalidation is wasteful because most stored facts are never read again; the freshness check happens only when the entry is fetched. Conflict-resolution evaluation as CI competency — ICLR 2026 added conflict resolution as a first-class memory-benchmark task; the world-state-perturbation eval forces this distinction into the release pipeline. The world-state-perturbation eval, in its simplest form, runs a session where the agent learns and stores a fact, mutates the entity in the source-of-truth between sessions, runs a new session that requires the fact, and scores whether the agent retrieves the now-stale memory and uses it confidently, retrieves it but flags freshness, refuses to use it, or correctly re-reads the source. Most production agents fail in the first three quadrants and never reach the fourth.

Cross-context contamination is the multi-tenant failure. Context from one user, customer, repository, or workspace leaks into another. In a single-agent system, this is a privacy violation; in a multi-agent system, the Adaline Labs piece described a more subtle failure — one agent's inference gets treated as ground truth by another agent downstream. The fix is namespace isolation by design: every memory tagged with the relevant identifier (user, tenant, project), every retrieval scoped by that identifier, and every multi-agent communication carrying explicit provenance. Without the identifier-as-first-class-citizen architecture, cross-context contamination is not a possibility — it is a certainty as soon as the system scales past one tenant.

Compaction and eviction failures round out the non-adversarial set. Spring AI's session-management documentation covers the canonical case: naive truncation silently discards tool-call sequences mid-exchange, leaving the model with orphaned results and broken turn structure. The fix is turn-safe compaction (every strategy snaps to turn boundaries; the model never sees an orphaned tool call). Compaction is itself a form of selective memory; the system decides what to summarize, what to drop, and what to preserve across a session boundary. That decision needs evaluation. A compaction step that summarizes incorrectly or drops the wrong operational state corrupts the agent's working context without surfacing any visible error. The Nexumo guide enumerates 12 eviction strategies — typed expiry, recency-weighted, confidence-weighted, source-aware, contradiction-triggered, role-based partitions, event-driven invalidation, budget-based with diversity guards — and the production pattern is to use several in combination rather than relying on a single strategy.

#Part V: Architecture Patterns

The 2026 survey paper at arXiv 2603.07670 organized production architectures into three patterns. The taxonomy is operational and practitioner-focused; the boundaries are not perfectly clean, but the three patterns capture where the field has consolidated.

Pattern A: context-only. Working memory in the context window; no external store. This works for short, single-session tasks and for many chatbot use cases. It fails immediately at long-horizon and multi-session use. The 1M-token context windows shipped in April 2026 stretch the regime where Pattern A works but do not eliminate the need for Patterns B and C — context-only systems have no persistence, no recall after restart, and quadratic attention cost at scale.

Pattern B: context plus retrieval store. Working memory in the context window; long-term records in an external vector or structured store; a retrieval pipeline injects relevant records each step. This is the workhorse pattern behind most production agents in 2026 — coding assistants, customer-service bots, enterprise copilots. The engineering burden is manageable: pick a vector database, write an embedding pipeline, instrument retrieval. The main challenge is retrieval quality. Most teams running Pattern B in production end up with the LangMem-style configuration: hot-path memory tools the agent calls explicitly when it needs to remember or look something up, plus a background memory manager that processes session transcripts asynchronously to extract durable facts.

Pattern C: tiered memory with learned control. Multiple tiers — context, structured database, vector store, cold archive — managed by a learned or prompted controller. MemGPT and AgeMem (arXiv 2601.01885) are the exemplars. AgeMem is particularly instructive: it treats memory operations (store, retrieve, update, summarize, discard) as policy actions optimized end-to-end via a three-stage progressive reinforcement learning pipeline with step-wise GRPO. The learned controller discovers non-obvious strategies — preemptive summarization before the context fills, source-weighted eviction, anticipatory retrieval — that hand-coded heuristics struggle to replicate. The cost is training compute and operational complexity. Pattern C is where the research frontier is in 2026; Pattern B is where production volume sits.

The reference five-layer architecture, drawing on the Chaitanya Prabuddha guide, fits cleanly into Pattern B. Layer 1: working memory (context window managed by a turn-safe compaction strategy; current conversation, retrieved tool results, in-flight reasoning). Layer 2: short-term episodic (last N turns, fast in-memory or Redis-backed cache). Layer 3: long-term episodic (event log with dual time + embedding indices; salience-scored; periodic compaction into session summaries). Layer 4: semantic facts (structured knowledge graph or upsert-friendly fact store; user preferences, organizational rules, project conventions). Layer 5: long-term vector index (Qdrant, Weaviate, or pgvector backing the embedding-based retrieval across all of Layers 3 and 4). Procedural memory — system prompt and tool definitions — sits orthogonally in version-controlled configuration rather than in any of these layers.

Session compaction is mandatory for production deployments. At session end, generate a 3-5 sentence session summary as a high-salience synthetic event and archive the individual turn events. One year of daily sessions becomes 365 searchable summaries rather than 50,000 individual turn events. Time-decay deletion: events older than 90 days with salience below 0.4 are soft-deleted in a nightly compaction pass. High-salience events — user corrections, stated preferences, notable errors, declared goals — are retained indefinitely. Together, salience-based storage, session compaction, and time-decay deletion maintain a bounded-size, high-signal memory store regardless of deployment duration.

Token budget allocation in production. The CallSphere guide documents the convergent pattern: 60% short-term (current conversation), 25% long-term (knowledge), 15% episodic (examples). Short-term memory always takes precedence — if the user said "I now prefer TypeScript" in the current conversation, that overrides a long-term memory saying "User prefers Python," and after the conversation the new preference is stored, replacing or annotating the old one. The token-budget decision is made by the memory orchestration layer, not the LLM, so the budget is deterministic and inspectable.

Eviction strategies. Ship with at least four in combination. Typed expiry sets aggressive TTLs on volatile state (session goals, temporary tool results, UI state) and longer TTLs on durable preferences. Recency-weighted eviction combines age with access patterns — entries not used, cited, or refreshed recently decay faster. Source-aware eviction treats database tool outputs higher than model summaries higher than inferred user intent. Contradiction-triggered eviction downgrades or invalidates older entries when a fresh disagreeing signal arrives, rather than letting both coexist as peers. Above some size threshold, budget-based eviction with diversity guards prevents memory monoculture — a long stream of recent tool outputs should not crowd out the stable user preferences that actually matter.

#Part VI: A 90-Day Implementation Playbook

Most teams reading this paper have at least one production agent that already has some form of memory — usually whatever LangChain, OpenAI, or Anthropic ships by default — and have not yet built deliberate memory infrastructure. This section is the ninety-day roadmap to graduate one agent to production-grade memory.

#Days 1-30: triage and scope

Inventory every agent. For each, document: (1) what does it currently remember and how (chat history, embedded vector store, no memory at all)? (2) which of the four memory types does it use, even informally? (3) what data is in scope — user preferences, project conventions, customer state, tool experiences, organizational policies? (4) what is the blast radius of memory failure — does a stale or poisoned memory entry cause incorrect customer responses, wasted engineering hours, regulatory exposure, or safety incidents?

Pick exactly one agent for the first hardening pass. The Adaline Labs piece argues for picking the agent where memory is most clearly a product-decision surface — a customer-support assistant rather than a one-off coding tool, an internal copilot rather than a transient automation. Pick the four scopes — user, task, project, operational — that apply, and write the four-property governance rule for each: owner, scope, expiry, deletion path. An entry without all four is not a memory; it is a leak.

Stand up reconciliation traces in CI. Pick five entities the agent will remember (user preferences, customer accounts, project conventions, tool configurations, organizational policies), record sessions where the agent learns each one, mutate the underlying source-of-truth between sessions, and replay the recorded sessions in CI. Score each replay on the world-state-perturbation rubric: does the agent use stale memory confidently (fail), retrieve but flag freshness (partial pass), refuse to answer (better partial pass), or re-read the source (full pass). The pass rate on reconciliation traces is one of the few numbers that correlates with whether your agent will hold up past its first month in production.

#Days 31-60: deploy framework + storage + observability

Pick the framework. The matrix from Part II — Mem0 for token efficiency, Zep for temporal reasoning, Letta for stateful runtime, LangMem for LangChain integration — narrows the decision. Pick the storage. The matrix from Part III — Qdrant for production default, Weaviate for tool registries / 50+ agents, Pinecone for zero-ops, pgvector for under-10M-on-Postgres — narrows that decision too.

Wire memory operations into your observability stack (the stack documented in the Agent Observability Stack paper). Every memory write, every memory read, every retrieval call, every compaction pass must be traced. The traces let you debug retrieval failures, audit memory poisoning, measure freshness lag, and answer the question "why did the agent say that?" by reading the memory entries that surfaced for the relevant turn. Without observability on memory operations, you cannot debug memory failures — and you cannot pass the SOC 2 / ISO 42001 audit that 2027 procurement reviews will demand.

Implement the four-property governance API. Every memory entry is created via a function that accepts owner, scope, expiry, and deletion path as required parameters; missing any of the four is a build error. Implement forget_user(user_id) (and forget_tenant, forget_project) that deletes from the vector store, the structured store, the cache, and the trace log. Audit the deletion path quarterly — manually verify, for a randomly chosen user, that the deletion path actually removes everything.

#Days 61-90: harden, red-team, ship

Run the memory-poisoning red-team suite. The InjecMEM, Zombie Agent, and MemoryGraft attacks have public reproduction code. Run the suite against your hardened agent and confirm that (a) single-interaction poisoning attempts fail, (b) tool-side poisoning attempts are caught by tool-response scanning (per the Indirect Prompt Injection Defense paper), and (c) MemoryGraft-style experience replay does not surface attacker-controlled content even when semantically similar.

Stand up the runbook. Who owns memory incidents — the same on-call team that owns model incidents, or a separate group? Who notifies regulators when a forget_user request was served but a stale entry persisted in a backup? What is the disclosure timeline if a memory-poisoning attack is detected? Most teams discover, in the runbook tabletop, that they have not assigned ownership for memory incidents. The fix is to write the runbook before the incident.

Pick the next agent. Repeat. By the third agent, the memory infrastructure is shared platform code rather than per-agent custom work, and the marginal cost of adding memory to a new agent drops by an order of magnitude.

#Part VII: Where This Goes (2027 and beyond)

File-system memory will become a primitive. Anthropic's Claude Opus 4.7 launch in April 2026 — with a 1M-token context window and file-system-based memory as the primary new capability — signals the direction. By 2027, agents will read and write to a structured workspace (SOUL.md, USER.md, project-specific markdown files) the same way a human collaborator would, and the memory framework's job will be partly to manage that workspace. Letta's Context Repositories (git-backed memory with versioning) point at the same future; CaMeL's restricted-Python control flow points at it from a different angle.

Learned memory policies will graduate from research to production. AgeMem-style training of memory operations as policy actions, with reinforcement-learning rewards over end-task performance, will land in mainstream frameworks within eighteen months. The cost-benefit will favor adoption: the operational complexity of training a memory policy is real, but the gains in retrieval quality and token efficiency are large enough that platform vendors will productize it.

Memory becomes a product surface. The Adaline Labs framing — memory is not a chat-history feature, it is a product decision — will become the default vocabulary. Product managers will own memory the way they own search ranking or recommendation feeds today, with explicit governance, transparent UX (Anthropic's "complete transparency" model), and user-facing controls for editing, exporting, and deleting.

The benchmark suite will mature. ICLR 2026 added conflict resolution as a first-class memory-benchmark task. The next two cycles will add adversarial-poisoning evaluations, multi-tenant isolation tests, freshness-aware retrieval benchmarks, and cross-language consolidation. By 2027, "passes the standard memory benchmark suite" will be a procurement bar the way "passes SOC 2" is today.

The compliance layer is forming. GDPR, CCPA, EU AI Act, and ISO/IEC 42001 collectively imply that production memory layers must support per-user deletion, audit trails of what was remembered and when, transparent disclosure to users, and explicit consent flows. The teams who built the four-property governance API in 2026 will pass the audit; the teams who relied on whatever the framework defaulted to will not.

The honest framing for 2027 is the same as Part I's. The four memory types are the right taxonomy. The frameworks and vector databases will continue to commodify. The frontier — multi-session reasoning, event ordering, contradiction resolution at production-volume context windows — is hard, and BEAM-10M scores tell the story: the field is at 48.6 today, with the gap concentrated exactly where production agents actually need to be reliable.

#Closing

Memory is the third infrastructure pillar after MCP and observability. The four-type taxonomy is settled. Four serious frameworks are in production. The vector-database market has a recommended default and a clear sizing curve. The failure modes — adversarial poisoning, drift, contamination, compaction errors — are documented and have working mitigations.

The single ask of the reader is the one that closes every paper in this canon. Pick one production agent. Spend ninety days deploying the patterns documented in Part VI. Then pick the next.

The work compounds. The first agent takes ninety days because most of the time is platform investment — the framework, the vector database, the observability hookup, the governance API, the red-team suite, the runbook. The second agent takes thirty days because the platform is built. By the fifth agent, adding production-grade memory is the kind of thing a senior engineer does in a sprint. That is the trajectory worth investing in.

#References

  1. Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory — arXiv 2504.19413 — https://arxiv.org/abs/2504.19413
  2. Introducing The Token-Efficient Memory Algorithm — Mem0 (April 2026) — https://mem0.ai/blog/mem0-the-token-efficient-memory-algorithm
  3. Memory Evaluation — Mem0 docs — https://docs.mem0.ai/core-concepts/memory-evaluation
  4. mem0ai/mem0 — GitHub README — https://github.com/mem0ai/mem0
  5. Zep: A Temporal Knowledge Graph Architecture for Agent Memory — arXiv 2501.13956 — https://arxiv.org/abs/2501.13956
  6. Zep Is The New State of the Art In Agent Memory — getzep blog (Jan 22 2025) — https://blog.getzep.com/state-of-the-art-agent-memory/
  7. Zep — Temporal Knowledge Graph blog post — https://blog.getzep.com/zep-a-temporal-knowledge-graph-architecture-for-agent-memory/
  8. Zep arXiv mirror — https://zep.link/sota-paper
  9. Announcing Letta — letta.com — https://www.letta.com/blog/announcing-letta
  10. Letta: The Stateful Agent Runtime — SudoAll (March 5 2026) — https://sudoall.com/letta-stateful-agents-nodejs/
  11. MemGPT — UC Berkeley Sky Computing Lab project page — https://sky.cs.berkeley.edu/project/memgpt/
  12. Introduction to Stateful Agents — Letta docs — https://docs.letta.com/guides/agents/memory
  13. Long-term memory — LangChain docs — https://docs.langchain.com/oss/python/langchain/long-term-memory
  14. LangMem Introduction — langchain-ai.github.io/langmem — https://langchain-ai.github.io/langmem
  15. How to Extract Episodic Memories — LangMem — https://langchain-ai.github.io/langmem/guides/extract_episodic_memories/
  16. LangMem Memory Tools docs — langchain-ai/langmem GitHub — https://github.com/langchain-ai/langmem
  17. Memory and new controls for ChatGPT — OpenAI blog — https://openai.com/blog/memory-and-new-controls-for-chatgpt
  18. Memory FAQ — OpenAI Help Center — https://help.openai.com/en/articles/10303002-how-does-memory-use-pastconversations
  19. Anthropic Tries to Win Users from ChatGPT With Memory Feature — Bloomberg Law (March 3 2026) — https://news.bloomberglaw.com/artificial-intelligence/anthropic-tries-to-win-users-from-chatgpt-with-memory-feature
  20. Anthropic's Claude catches up to ChatGPT and Gemini with upgraded memory features — The Verge (Oct 23 2025) — https://www.theverge.com/news/804124/anthropic-claude-ai-memory-upgrade-all-subscribersa
  21. Agent Security Bench (ASB) memory-poisoning baseline — ICLR 2025 — https://iclr.cc/virtual/2025/poster/29432
  22. InjecMEM: Memory Injection Attack — OpenReview (Sept 25 2025) — https://openreview.net/pdf/5f0aedb32147ba6ec36d85f54a53ad381dc22324.pdf
  23. Zombie Agent — arXiv 2602.15654 — https://arxiv.org/abs/2602.15654
  24. MemoryGraft: Persistent Compromise via Poisoned Experience Retrieval — arXiv 2512.16962 — https://arxiv.org/pdf/2512.16962
  25. Agent Memory Drift: Why Reconciliation Is the Loop You're Missing — Tianpan (Apr 27 2026) — https://tianpan.co/blog/2026-04-27-agent-memory-reconciliation-drift-loop
  26. AI Agent Memory Design For Production Agents — Adaline Labs (May 2 2026) — https://labs.adaline.ai/p/agent-memory-is-a-product-surface
  27. When Agent Memory Learns to Forget — Nexumo via Medium (Mar 15 2026) — https://medium.com/@Nexumo_/when-agent-memory-learns-to-forget-21fb08a88513
  28. Spring AI Session — Event-Sourced Short-Term Memory with Context Compaction (Apr 15 2026) — https://spring.io/blog/2026/04/15/spring-ai-session-management
  29. Choosing A Vector DB For Multi-Agent Systems 2026 (Benchmarked) — RankSquire (Mar 20 2026) — https://ranksquire.com/2026/03/20/choosing-a-vector-db-for-multi-agent-systems-2026/
  30. Vector Database Performance Benchmarks 2026 — AICited Technical Journal — https://www.aicited.org/technical-journals/technical-journal-vector-database-performance-benchmarks-comprehensive-evaluation-for-production-rag-systems-in-2026
  31. Pinecone vs Weaviate vs Qdrant vs Chroma: Benchmarks at 1M-100M Vectors (2026) — Leaper — https://leaper.dev/blog/vector-databases-compared-2026.html
  32. Pinecone vs Weaviate vs Qdrant vs Chroma: Honest Comparison (2026) — Inventiple (Apr 18 2026) — https://www.inventiple.com/blog/pinecone-vs-weaviate-vs-qdrant-vs-chroma
  33. Memory for Autonomous LLM Agents: Mechanisms, Evaluation, Emerging Frontiers — arXiv 2603.07670 — https://arxiv.org/abs/2603.07670v1
  34. AI Agent Memory Architectures: Episodic, Semantic, and Working Memory in Production — Chaitanya Prabuddha (Mar 29 2026) — https://chaitanyaprabuddha.com/blog/ai-agent-memory-architectures
  35. Agentic Memory (AgeMem): Learning Unified LTM and STM Management — arXiv 2601.01885 — https://arxiv.org/abs/2601.01885v1
  36. Agent Memory Systems: Short-Term, Long-Term, Episodic — CallSphere — https://callsphere.ai/blog/agent-memory-systems-short-term-long-term-episodic-memory-ai-2026.md
perea.ai Research

One deep piece a month. Three weekly signals.

Get every B2A field report, protocol update, and benchmark from real audits — published before the rest of the market sees it. No filler. Unsubscribe in one click.