#Executive summary
Long-context LLMs do not solve the long-term memory problem. That is the empirical finding of BEAM (Benchmarking and Enhancing Long-Term Memory in LLMs), Tavakoli et al.'s ICLR 2026 benchmark released as arXiv 2510.27246 in October 2025.[1][2][3] Even models with 1M-token context windows — with and without retrieval-augmentation — struggle as conversations lengthen. The companion contribution, the LIGHT framework, equips LLMs with three complementary memory systems inspired by human cognition (episodic memory, working memory, scratchpad) and consistently delivers a 3.5%–12.69% average improvement over the strongest long-context and RAG baselines.[1][4][5]
The headline numbers are dramatic at extreme context lengths.[6] At 100K tokens: +49.1%[6] for Llama-4-Maverick and +44.3%[6] for GPT-4.1-nano over long-context baselines. At 1M tokens: +75.9%[6] for GPT-4.1-nano and +60.1%[6] for Qwen2.5-32B. At 10M tokens — where no baseline natively supports the full context — LIGHT achieves +155.7%[6] for Llama-4-Maverick, +107.3%[6] for GPT-4.1-nano, and +57.3%[6] for Gemini-2.0-flash (the only model where LIGHT slightly trails RAG in some configurations, attributed to model-specific retrieval behavior).[6]
The benchmark itself is the longer-lived contribution. BEAM contains 100 conversations of 128K, 500K, 1M, and 10M tokens spanning general, coding, and math domains, with 2,000 human-validated probing questions across 10 memory dimensions: abstention, contradiction resolution, event ordering, information extraction, instruction following, knowledge update, multi-session reasoning, preference following, summarization, and temporal reasoning.[7] The dataset ships on HuggingFace as Mohammadta/BEAM (128K, 500K, 1M chats) and Mohammadta/BEAM-10M (10M-token chats), with code at github.com/mohammadtavakoli78/BEAM.[8][7]
This paper unpacks BEAM and LIGHT for production teams. It catalogs the 10 memory dimensions, walks through the LIGHT three-memory architecture with concrete model choices and token thresholds, compares BEAM to the existing long-context memory benchmark cohort (LongMemEval, LoCoMo, MemoryBench, RULER, Zep eval suite, Letta/MemGPT evals), and maps the empirical findings into the production memory-architecture decision (Letta, Mem0, Zep, LangMem, LightMem).
#What BEAM measures
The benchmark's structural innovation is multi-domain, multi-scale, narratively-coherent generation. Existing long-context benchmarks (LongMemEval at ~115K tokens per session; LoCoMo at 9K tokens average over up to 35 sessions) tested simple recall on relatively short conversations.[9][10] BEAM extends this by an order of magnitude on context length and by 10x on memory-ability coverage.
#The data generation pipeline
The pipeline generates conversations through recursive plan decomposition.[4] First, a high-level conversation plan defines the narrative for a domain plus a simulated user with generated attributes. The plan recursively decomposes into finer sub-plans that specify storyline progression. Sub-plans generate chronologically ordered user turns, expanded with assistant responses. A follow-up detection module decides when the user would naturally pose a clarifying question; if triggered, it emits an additional user query. The pipeline finally generates probing questions targeting 10 distinct memory dimensions, validated by human annotators for quality.[4][11]
#The 10 memory dimensions
BEAM defines memory abilities as 10 independent question categories:[7]
- Abstention — does the model withhold answers when evidence is missing?
- Contradiction Resolution — can the model detect and reconcile inconsistent statements across widely separated turns?
- Event Ordering — can the model recognize and reconstruct the sequence of evolving information?
- Information Extraction — recall of entities and factual details in long histories.
- Instruction Following — sustained adherence to user-specified constraints over long contexts.
- Knowledge Update — revising stored facts as new ones appear.
- Multi-Session Reasoning — inference integrating evidence across multiple, non-adjacent dialogue segments.
- Preference Following — personalized responses adapting to evolving preferences.
- Summarization — abstracting and compressing dialogue content.
- Temporal Reasoning — reasoning about explicit and implicit time relations.
Each conversation in BEAM has multiple probing questions sampled across these dimensions; the 2,000 total questions provide statistical power for per-dimension and per-context-length analysis.[7][11]
#The four context lengths
BEAM ships at four context-length tiers — 128K, 500K, 1M, and 10M tokens.[8] The tier structure is deliberate: 128K is within most modern LLM context windows; 500K stresses long-context attention; 1M aligns with the largest commercial context windows (Gemini 2.5 Pro, GPT-4.1, Claude Sonnet); 10M is far beyond any production context window and forces memory-augmented architectures.[6] The 10M-token tier (Mohammadta/BEAM-10M on HuggingFace) is the genuinely novel scale; LongMemEval and LoCoMo top out below 200K tokens.[9][10]
#How LIGHT works
LIGHT (the cognitive-inspired memory framework that ships with BEAM) is a three-memory architecture. Inspired by Sridhar et al. 2023 and Binder & Desai 2011, it emulates human episodic memory, working memory, and the practice of taking notes on a scratchpad.[4][12]
#Component 1: Episodic memory
Episodic memory is a long-term index over the full conversation, used for retrieval.[4][7] The implementation indexes the entire dialogue with chunked embeddings; at inference time, given the user's query x, the system retrieves top-k most relevant chunks E.[11] The retrieval policy is decoupled from the LLM backbone, so any embedding model can power episodic recall — the BEAM paper does not prescribe a specific embedding model, leaving it as a deployment-time choice.[4]
#Component 2: Working memory
Working memory captures the most recent z user-assistant dialogue turns, ensuring that the LLM has continuous awareness of the immediate conversation context.[4][7] This is the simplest of the three primitives — a sliding window of the last z turns that always loads into context — but it ensures the model never loses recent state to retrieval miss.[11]
#Component 3: Scratchpad
The scratchpad is the most distinctive contribution. After each dialogue pair, LIGHT uses Qwen2.5-32B-AWQ to reason over the current and preceding turn and extract salient content as a structured note.[11][12] The scratchpad iteratively merges with earlier versions; once content exceeds a 30K-token threshold (substantially shorter than the raw conversation), it is compressed into a 15K-token summary by GPT-4.1-nano.[11] This process maintains efficiency and long-term coherence, analogous to the gradual abstraction of semantic memory in humans. Unlike the episodic index, the scratchpad is not stored in a vector database but is provided directly as contextual input during inference.[11]
The scratchpad is the load-bearing component. The ablation study reports that removing the scratchpad consistently produces the largest performance drop across context lengths — larger than removing working memory or episodic retrieval alone.[4][11]
#Inference
At inference time, the LLM generates the answer y by conditioning on the question x, retrieved episodic content E, working memory W, and the accumulated scratchpad Sx: y = π(x, E, W, Sx).[11] The prompt template (Listing 44 in the paper's appendix H) asks the model to integrate all three sources before answering.[4] The architecture is model-agnostic — LIGHT runs on Llama-4-Maverick, GPT-4.1-nano, Qwen2.5-32B, and Gemini-2.0-flash backbones in the published experiments.[6]
#The empirical findings
#Long-context LLMs degrade with conversation length
The headline empirical claim: LLM performance degrades substantially as conversation length grows, even when the conversation fits inside the context window.[4][11] This generalizes prior LongMemEval and LoCoMo findings to the 1M-10M token regime. At BEAM's 100K tier, all tested models (Llama-4-Maverick, GPT-4.1-nano, Qwen2.5-32B, Gemini-2.0-flash) can in principle attend to the full conversation, yet probing-question accuracy is materially below ceiling. At 1M tokens, only the largest commercial models can attend to the full context, and probing-question accuracy drops further. At 10M tokens, no model natively supports the full context.[6][11]
#LIGHT improvements grow with context length
The LIGHT improvements grow monotonically with context length:[6]
| Context length | Llama-4-Maverick | GPT-4.1-nano | Qwen2.5-32B | Gemini-2.0-flash | Source |
|---|---|---|---|---|---|
| 100K | +49.1% | +44.3% | — | — | [6] |
| 1M | — | +75.9% | +60.1% | — | [6] |
| 10M | +155.7% | +107.3% | — | +57.3% | [6] |
These deltas are improvements over the strongest long-context or RAG baseline at each context length. The 10M Gemini-2.0-flash result is the one exception where LIGHT slightly trails RAG in some configurations, attributed to model-specific retrieval behavior in Gemini's serving stack.[6]
The average improvement across all backbones and context lengths is 3.5%–12.69%, depending on the backbone LLM.[1][11] The dramatic 100%+ deltas at 10M tokens reflect not just LIGHT's strength but also the brittleness of long-context-only baselines: at 10M tokens the baselines truncate the conversation and lose access to most of the relevant context, while LIGHT's three-memory architecture retains compressed access to the full history.[6][11]
#Ablation: scratchpad is the load-bearing component
The ablation study isolates the contribution of each LIGHT component (episodic, working, scratchpad, noise filtering) at each context length.[11] At 100K, removing the scratchpad reduces performance by ~0.89%; removing noise filtering produces a similar drop. At longer contexts the scratchpad's contribution grows more pronounced — the gradual abstraction of dialogue content into compressed semantic notes is what allows LIGHT to scale to 10M tokens without exhausting the inference-time context budget.[11]
#How BEAM compares to the benchmark cohort
The companion analysis "Memory Benchmarks for Conversational Agents" surveys the existing memory-evaluation landscape and scores it on a 7-property continuity rubric (persistence, update handling, temporal ordering, disambiguation, reconstruction, model independence, operational usefulness). The median score across the surveyed benchmarks is 1.0/7; the mean is 0.43/7. No single benchmark covers more than 2 properties.[13]
Where each benchmark fits:[13]
- LongMemEval — 500 questions across 5 task categories spanning conversational histories in tens of thousands of tokens. Tests retrieval and reasoning over distributed facts. Property coverage 0/7 (it tests long-context recall, not memory continuity).
- BEAM — extends the context envelope past 1M tokens; benchmarks retrieval and light reasoning at scale. Same structural property as LongMemEval — a context-window test, not a continuity test.
- LoCoMo — 50 dialogues averaging 9K tokens over up to 35 sessions, with persona grounding and temporal event graphs.[10] Tests question answering, event summarization, multi-modal dialogue. Property coverage 1.0/7.
- MemoryBench — Mem0's evaluation suite. Property coverage 0.5/7.
- RULER — fixed-format long-context probe suite (needle-in-a-haystack-style synthetic tests).
- Zep eval suite — graph-edge-correctness tests for graph-based memory. Property coverage 1.0/7.
- Letta / MemGPT evaluations — measure tool-call correctness on memory operations. Property coverage 0.5/7.
The decisive structural point about BEAM (and LongMemEval): a language model with a context window large enough to fit the full benchmark passes both with no memory architecture at all.[13] These benchmarks measure long-context capacity, not memory continuity. They cannot distinguish a memory system from a long-context model. That is fine when the question is "can my model attend over a very long prompt?" — it's the wrong question when the question is "does my agent maintain consistent state across days?"
#Implications for production memory architectures
The BEAM and LIGHT findings should be read as input to the production memory-architecture decision, not as a definitive ranking. Per the AgentMarketCap April 2026 vendor cohort analysis:[14]
- Letta (formerly MemGPT) — OS-inspired memory tiers (core/archival/recall), agents self-edit memory via function calls. Best for long-running conversational agents (weeks to months) and compliance-sensitive enterprise agents. Letta's Context-Bench measures agentic context engineering — whether models can strategically decide what context to retrieve — which is a different capability than BEAM's probing-question accuracy.[14]
- Mem0 — extraction-based memory with the largest production-deployment footprint. Best for interactive customer-facing agents and rapid prototyping. On LongMemEval with GPT-4o, Mem0 scores 49.0%;[14] on LoCoMo, Mem0 scores 68.5%.[14]
- Zep — temporal-graph-based memory with the strongest performance on entity-heavy temporal reasoning. On LongMemEval with GPT-4o, Zep scores 63.8%[14] (a 15-point gap over Mem0). Ingestion latency is a real constraint — immediate post-ingestion retrieval can fail; correct answers sometimes only surface hours later after background graph processing.[14]
- LangMem — LangGraph-native memory primitive. Best for LangGraph batch processing. Search latency p50 of ~17.99s[14] (vs Mem0's 0.148s[14]) makes LangMem unsuitable for real-time interactive agents.[14]
- LightMem — distinct from LIGHT (different paper, arXiv 2510.18866 by zjunlp). Three-stage memory inspired by Atkinson-Shiffrin: cognition-inspired sensory memory + topic-aware short-term + long-term with sleep-time update. Reports up to 10.9%[15] gains on LongMemEval with up to 117x token reduction,[15] 159x API-call reduction,[15] and 12x runtime improvement.[15][16]
The mapping from BEAM evidence to architecture choice:
- If your conversations stay below 500K tokens, long-context-only is competitive — none of the LIGHT/Letta/Mem0/Zep variants dominate at this scale, and the operational simplicity of context-only is real.
- If your conversations regularly exceed 1M tokens, structured memory becomes load-bearing. LIGHT's +75.9%[6] / +60.1%[6] deltas on GPT-4.1-nano and Qwen2.5-32B are not subtle; the production memory architectures (Letta, Mem0, Zep) materialize similar deltas in different shapes.
- If your conversations exceed 10M tokens, you cannot use long-context-only. The 10M-token tier of BEAM is the empirical anchor that this regime exists and that structured memory is the only path forward.
#Reproducing BEAM evaluations
The BEAM repository ships at github.com/mohammadtavakoli78/BEAM with code, evaluation scripts, and dataset access through HuggingFace.[7][8] The reproduction recipe:
- Pull the dataset:
Mohammadta/BEAM(128K-1M chats) orMohammadta/BEAM-10M. - Pick a backbone LLM. The published results cover Llama-4-Maverick, GPT-4.1-nano, Qwen2.5-32B, and Gemini-2.0-flash; any LLM with sufficient context window can be evaluated.
- Generate baselines: pure long-context (the LLM with the full conversation in context, where context window allows) and RAG (chunked embeddings with top-k retrieval).
- Implement LIGHT: episodic-memory chunking + embedding, working-memory window of last z turns, scratchpad construction with Qwen2.5-32B-AWQ extraction and GPT-4.1-nano compression at the 30K threshold.
- Run probing questions across all 10 memory dimensions and report per-dimension and per-context-length accuracy.
The evaluation cost is non-trivial. At 10M tokens per conversation × 100 conversations × multiple backbones × multiple memory configurations, full reproduction is a 5-figure-USD inference budget. The HuggingFace dataset and the open-source code make partial reproduction (subset of conversations, subset of backbones) accessible to academic and small-team budgets.
#What this paper does not cover
It does not compare BEAM head-to-head with LongMemEval, LoCoMo, MemoryBench, or RULER on the same backbones — that is a multi-paper comparison that would require running each benchmark's evaluation script on the full backbone set. It does not benchmark the production memory vendors (Letta, Mem0, Zep, LangMem, LightMem) on BEAM directly — none of the vendors have published BEAM results as of May 2026; cross-vendor BEAM evaluations are an open research opportunity. It does not cover the multi-modal extension of BEAM (image + dialogue) — BEAM-10M includes 10 different conversation plans per 10M-token chat but does not test multi-modal recall.
It also does not cover the security and adversarial implications of long-context memory — prompt injection attacks against scratchpad-based architectures, or the reproducibility of LIGHT's compression step under adversarial inputs. Those are separate research directions.
#Implications for the agent-memory canon
BEAM is the benchmark that pushes the long-context memory question to its scaling limit. The 10M-token tier is the empirical anchor for the claim that long-context-only does not generalize past 1M tokens; the LIGHT framework is the cognitive-inspired prescription for what to do instead.[4][6] Production teams should read BEAM not as a definitive memory benchmark — its property coverage is 0/7 on the continuity rubric — but as the most rigorous existing test of the long-context-vs-structured-memory choice at extreme scale.
The architectural prescription is unambiguous. At 1M+ token conversations, three primitives matter: an episodic memory that can be queried by embedding similarity, a working memory window of recent turns, and a scratchpad that compresses dialogue content into retrievable semantic notes. Those three primitives map cleanly onto the production memory vendors: Letta's archival/recall + core memory implements episodic + working; Mem0's extraction implements scratchpad-equivalent compression; Zep's temporal graph implements episodic with structural typing. The benchmark suggests teams should evaluate their chosen vendor on BEAM-style probing questions to confirm the architecture's behavior at the context lengths the team will see in production.
The 3.5-12.69%[1] average improvement is the realistic per-team expectation; the 100%+[6] deltas at 10M tokens are the dramatic-but-narrow case that applies only when the alternative is a 1M-context model truncating to 1M of a 10M-token history. Production teams operating at 100K-500K conversational length will see the modest end of LIGHT's range; teams operating at multi-million-token scale will see the dramatic end. Either way, BEAM is the benchmark that lets the team measure rather than guess.
#References
References
-
Tavakoli, Salemi, Ye, Abdalla, Zamani, Mitchell. "Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs." arXiv 2510.27246, October 31, 2025. https://arxiv.org/abs/2510.27246 ↩ ↩2 ↩3 ↩4
-
ICLR 2026 Poster page for "Beyond a Million Tokens" with author list. https://iclr.cc/virtual/2026/poster/10006595 ↩
-
arXiv abstract page for 2510.27246. https://arxiv.org/abs/2510.27246 ↩
-
BEAM paper PDF v1 with LIGHT framework architecture description. https://arxiv.org/pdf/2510.27246 ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10 ↩11
-
OpenReview submission page for BEAM with TL;DR claims. https://openreview.net/forum?id=y59hf5lrMn ↩
-
BEAM paper PDF Section 4 main results table with per-context-length improvement deltas. https://www.arxiv.org/pdf/2510.27246 ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10 ↩11 ↩12 ↩13 ↩14 ↩15 ↩16 ↩17 ↩18 ↩19 ↩20 ↩21 ↩22
-
HuggingFace dataset card
Mohammadta/BEAMwith 128K/500K/1M conversation tiers and 10 question types. https://huggingface.co/datasets/Mohammadta/BEAM ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 -
HuggingFace dataset card
Mohammadta/BEAM-10Mwith the 10M-token tier. https://huggingface.co/datasets/Mohammadta/BEAM-10M ↩ ↩2 ↩3 -
GitHub repository mohammadtavakoli78/BEAM with code and evaluation scripts. https://github.com/mohammadtavakoli78/BEAM ↩ ↩2
-
Maharana et al. "Evaluating Very Long-Term Conversational Memory of LLM Agents" (LoCoMo) arXiv 2402.17753, 2024. https://arxiv.org/abs/2402.17753 ↩ ↩2 ↩3
-
BEAM paper LIGHT scratchpad construction Section 3.2 with Qwen2.5-32B-AWQ extraction and 30K-to-15K compression with GPT-4.1-nano. https://arxiv.org/pdf/2510.27246 ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10 ↩11 ↩12 ↩13 ↩14 ↩15
-
OpenReview PDF v3 for BEAM with ablation study. https://openreview.net/pdf/70333b782ce6c570c5c01e8c503c71a89526c3d5.pdf ↩ ↩2
-
Memory benchmark continuity-rubric companion paper analyzing LOCOMO, LongMemEval, BEAM, MemoryBench, Zep, Letta/MemGPT, RULER. https://www.arxiv.org/pdf/2604.10981 ↩ ↩2 ↩3
-
AgentMarketCap "Agent Long-Term Memory in 2026: Letta, Mem0, Zep, and LangMem Compared" 2026-04-08, with vendor benchmark data. https://agentmarketcap.ai/blog/2026/04/08/agent-long-term-memory-architecture-letta-memgpt-langmem-zep ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9
-
Zhang et al. "LightMem: Lightweight and Efficient Memory-Augmented Generation" arXiv 2510.18866, October 21, 2025. https://arxiv.org/abs/2510.18866v1 ↩ ↩2 ↩3 ↩4
-
zjunlp/LightMem GitHub repository with comprehensive baseline evaluation framework. https://github.com/zjunlp/LightMem/blob/main/README.md ↩