MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks
- URL: http://arxiv.org/abs/2602.16313v1
- Date: Wed, 18 Feb 2026 09:49:14 GMT
- Title: MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks
- Authors: Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen, Lang Yin, Ze Chen, Tong Arthur Wu, Siru Ouyang, Zihan Wang, Jiaxin Pei, Julian McAuley, Yejin Choi, Alex Pentland,
- Abstract summary: Existing evaluations of agents with memory typically assess memorization and action in isolation.<n>We introduce MemoryArena, a unified evaluation gym for benchmarking agent memory in multi-session Memory-Agent-Environment loops.<n> MemoryArena supports evaluation across web navigation, preference-constrained planning, progressive information search, and sequential formal reasoning.
- Score: 55.145729491377374
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing evaluations of agents with memory typically assess memorization and action in isolation. One class of benchmarks evaluates memorization by testing recall of past conversations or text but fails to capture how memory is used to guide future decisions. Another class focuses on agents acting in single-session tasks without the need for long-term memory. However, in realistic settings, memorization and action are tightly coupled: agents acquire memory while interacting with the environment, and subsequently rely on that memory to solve future tasks. To capture this setting, we introduce MemoryArena, a unified evaluation gym for benchmarking agent memory in multi-session Memory-Agent-Environment loops. The benchmark consists of human-crafted agentic tasks with explicitly interdependent subtasks, where agents must learn from earlier actions and feedback by distilling experiences into memory, and subsequently use that memory to guide later actions to solve the overall task. MemoryArena supports evaluation across web navigation, preference-constrained planning, progressive information search, and sequential formal reasoning, and reveals that agents with near-saturated performance on existing long-context memory benchmarks like LoCoMo perform poorly in our agentic setting, exposing a gap in current evaluations for agents with memory.
Related papers
- Enhancing Conversational Agents via Task-Oriented Adversarial Memory Adaptation [64.69535903624033]
We propose an Adversarial Memory Adaptation mechanism (AMA) that aligns memory construction and update with task objectives by simulating task execution.<n>AMA can be integrated into various existing memory systems, and extensive experiments on long dialogue benchmark LoCoMo demonstrate its effectiveness.
arXiv Detail & Related papers (2026-01-29T14:42:34Z) - Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents [20.357475946040054]
We introduce textscMem2ActBench, a benchmark for evaluating whether agents can proactively leverage long-term memory to execute tool-based actions.<n>A reverse-generation method produces 400 tool-use tasks, with human evaluation confirming 91.3% are strongly memory-dependent.
arXiv Detail & Related papers (2026-01-13T06:22:32Z) - Memory in the Age of AI Agents [217.9368190980982]
This work aims to provide an up-to-date landscape of current agent memory research.<n>We identify three dominant realizations of agent memory, namely token-level, parametric, and latent memory.<n>To support practical development, we compile a comprehensive summary of memory benchmarks and open-source frameworks.
arXiv Detail & Related papers (2025-12-15T17:22:34Z) - Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory [89.65731902036669]
Evo-Memory is a streaming benchmark and framework for evaluating self-evolving memory in large language model (LLM) agents.<n>We evaluate over ten representative memory modules and evaluate them across 10 diverse multi-turn goal-oriented and single-turn reasoning and QA datasets.
arXiv Detail & Related papers (2025-11-25T21:08:07Z) - Evaluating Long-Term Memory for Long-Context Question Answering [100.1267054069757]
We present a systematic evaluation of memory-augmented methods using LoCoMo, a benchmark of synthetic long-context dialogues annotated for question-answering tasks.<n>Our findings show that memory-augmented approaches reduce token usage by over 90% while maintaining competitive accuracy.
arXiv Detail & Related papers (2025-10-27T18:03:50Z) - MemGen: Weaving Generative Latent Memory for Self-Evolving Agents [57.1835920227202]
We propose MemGen, a dynamic generative memory framework that equips agents with a human-esque cognitive faculty.<n>MemGen enables agents to recall and augment latent memory throughout reasoning, producing a tightly interwoven cycle of memory and cognition.
arXiv Detail & Related papers (2025-09-29T12:33:13Z) - Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions [22.190297901876278]
We identify four core competencies essential for memory agents: accurate retrieval, test-time learning, long-range understanding, and selective forgetting.<n>Existing benchmarks either rely on limited context lengths or are tailored for static, long-context settings like book-based QA.<n>We introduce MemoryAgentBench, a new benchmark specifically designed for memory agents.
arXiv Detail & Related papers (2025-07-07T17:59:54Z) - Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation [39.69790911626182]
The incorporation of memory into agents is essential for numerous tasks within the domain of Reinforcement Learning (RL)<n>The term memory'' encompasses a wide range of concepts, which, coupled with the lack of a unified methodology for validating an agent's memory, leads to erroneous judgments about agents' memory capabilities.<n>This paper aims to streamline the concept of memory in RL by providing practical precise definitions of agent memory types.
arXiv Detail & Related papers (2024-12-09T14:34:31Z) - Evaluating Long-Term Memory in 3D Mazes [10.224858246626171]
Memory Maze is a 3D domain of randomized mazes designed for evaluating long-term memory in agents.
Unlike existing benchmarks, Memory Maze measures long-term memory separate from confounding agent abilities.
We find that current algorithms benefit from training with truncated backpropagation through time and succeed on small mazes, but fall short of human performance on the large mazes.
arXiv Detail & Related papers (2022-10-24T16:32:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.