RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction
- URL: http://arxiv.org/abs/2601.06966v1
- Date: Sun, 11 Jan 2026 15:49:36 GMT
- Title: RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction
- Authors: Haonan Bian, Zhiyuan Yao, Sen Hu, Zishan Xu, Shaolei Zhang, Yifu Guo, Ziliang Yang, Xueran Han, Huacan Wang, Ronghao Chen,
- Abstract summary: We introduce **RealMem**, the first benchmark grounded in realistic project scenarios.<n>RealMem comprises over 2,000 cross-session dialogues across eleven scenarios, utilizing natural user queries for evaluation.<n>We propose a pipeline that integrates Project Foundation Construction, Multi-Agent Dialogue Generation, and Memory synthesis and Schedule Management to simulate the dynamic evolution of memory.
- Score: 21.670389104174536
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As Large Language Models (LLMs) evolve from static dialogue interfaces to autonomous general agents, effective memory is paramount to ensuring long-term consistency. However, existing benchmarks primarily focus on casual conversation or task-oriented dialogue, failing to capture **"long-term project-oriented"** interactions where agents must track evolving goals. To bridge this gap, we introduce **RealMem**, the first benchmark grounded in realistic project scenarios. RealMem comprises over 2,000 cross-session dialogues across eleven scenarios, utilizing natural user queries for evaluation. We propose a synthesis pipeline that integrates Project Foundation Construction, Multi-Agent Dialogue Generation, and Memory and Schedule Management to simulate the dynamic evolution of memory. Experiments reveal that current memory systems face significant challenges in managing the long-term project states and dynamic context dependencies inherent in real-world projects. Our code and datasets are available at [https://github.com/AvatarMemory/RealMemBench](https://github.com/AvatarMemory/RealMemBench).
Related papers
- RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies [54.23445842621374]
Memory is critical for long-horizon and history-dependent robotic manipulation.<n>Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms.<n>We introduce RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models.
arXiv Detail & Related papers (2026-03-04T21:59:32Z) - LifeBench: A Benchmark for Long-Horizon Multi-Source Memory [22.24847456134897]
We introduce Lifebench, which features densely connected, long-horizon event simulation.<n>Lifebench pushes AI agents beyond simple recall, requiring the integration of declarative and non-declarative memory reasoning.<n>Performance results show that top-tier, state-of-the-art memory systems reach just 55.2% accuracy.
arXiv Detail & Related papers (2026-03-04T06:42:17Z) - RMBench: Memory-Dependent Robotic Manipulation Benchmark with Insights into Policy Design [77.30163153176954]
RMBench is a simulation benchmark comprising 9 manipulation tasks that span multiple levels of memory complexity.<n>Mem-0 is a modular manipulation policy with explicit memory components designed to support controlled ablation studies.<n>We identify memory-related limitations in existing policies and provide empirical insights into how architectural design choices influence memory performance.
arXiv Detail & Related papers (2026-03-01T18:59:59Z) - EverMemBench: Benchmarking Long-Term Interactive Memory in Large Language Models [16.865998112859604]
We introduce EverMemBench, a benchmark featuring multi-party, multi-group conversations spanning over 1 million tokens.<n>EverMemBench evaluates memory systems across three dimensions through 1,000+ QA pairs.
arXiv Detail & Related papers (2026-02-01T16:13:08Z) - MetaMem: Evolving Meta-Memory for Knowledge Utilization through Self-Reflective Symbolic Optimization [57.17751568928966]
We propose MetaMem, a framework that augments memory systems with a self-evolving meta-memory.<n>During meta-memory optimization, MetaMem iteratively distills transferable knowledge utilization experiences across different tasks.<n>Extensive experiments demonstrate the effectiveness of MetaMem, which significantly outperforms strong baselines by over 3.6%.
arXiv Detail & Related papers (2026-01-27T04:46:23Z) - HiMem: Hierarchical Long-Term Memory for LLM Long-Horizon Agents [3.9396865837159822]
HiMem is a hierarchical long-term memory framework for long-horizon dialogues.<n>It supports memory construction, retrieval, and dynamic updating during sustained interactions.<n>Results show HiMem consistently outperforms representative baselines in accuracy, consistency, and long-term reasoning.
arXiv Detail & Related papers (2026-01-10T01:26:01Z) - EvolMem: A Cognitive-Driven Benchmark for Multi-Session Dialogue Memory [63.84216832544323]
EvolMem is a new benchmark for assessing multi-session memory capabilities of large language models (LLMs) and agent systems.<n>To construct the benchmark, we introduce a hybrid data synthesis framework that consists of topic-initiated generation and narrative-inspired transformations.<n>Extensive evaluation reveals that no LLM consistently outperforms others across all memory dimensions.
arXiv Detail & Related papers (2026-01-07T03:14:42Z) - Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents [76.76004970226485]
Long-term memory is a critical capability for multimodal large language model (MLLM) agents.<n>Mem-Gallery is a new benchmark for evaluating multimodal long-term conversational memory in MLLM agents.
arXiv Detail & Related papers (2026-01-07T02:03:13Z) - Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory [89.65731902036669]
Evo-Memory is a streaming benchmark and framework for evaluating self-evolving memory in large language model (LLM) agents.<n>We evaluate over ten representative memory modules and evaluate them across 10 diverse multi-turn goal-oriented and single-turn reasoning and QA datasets.
arXiv Detail & Related papers (2025-11-25T21:08:07Z) - MEMTRACK: Evaluating Long-Term Memory and State Tracking in Multi-Platform Dynamic Agent Environments [6.12783571098263]
MEMTRACK is a benchmark designed to evaluate long-term memory and state tracking in multi-platform agent environments.<n>Each benchmark instance provides a chronologically platform-interleaved timeline, with noisy, conflicting, cross-referring information.<n>Our benchmark tests memory capabilities such as acquistion, selection and conflict resolution.
arXiv Detail & Related papers (2025-10-01T18:34:03Z) - Meta-Memory: Retrieving and Integrating Semantic-Spatial Memories for Robot Spatial Reasoning [5.740131013400576]
We propose Meta-Memory, a large language model (LLM)-driven agent that constructs a high-density memory representation of the environment.<n>The key innovation of Meta-Memory lies in its capacity to retrieve and integrate relevant memories through joint reasoning over semantic and spatial modalities.<n> Experimental results show that Meta-Memory significantly outperforms state-of-the-art methods on both the SpaceLocQA and the public NaVQA benchmarks.
arXiv Detail & Related papers (2025-09-25T05:22:52Z) - FindingDory: A Benchmark to Evaluate Memory in Embodied Agents [49.18498389833308]
We introduce a new benchmark for long-range embodied tasks in the Habitat simulator.<n>This benchmark evaluates memory-based capabilities across 60 tasks requiring sustained engagement and contextual awareness.
arXiv Detail & Related papers (2025-06-18T17:06:28Z) - Interpersonal Memory Matters: A New Task for Proactive Dialogue Utilizing Conversational History [13.389395397698035]
We introduce a novel task named Memory-aware Proactive Dialogue (MapDia)<n>By the task, we then propose an automatic data construction method and create the first Chinese Memory-aware Proactive dataset (ChMapData)<n> Furthermore, we introduce a joint framework based on Retrieval Augmented Generation (RAG), featuring three modules: Topic Summarization, Topic Retrieval, and Proactive Topic-shifting Detection and Generation.
arXiv Detail & Related papers (2025-03-07T05:19:17Z) - Hello Again! LLM-powered Personalized Agent for Long-term Dialogue [63.65128176360345]
We introduce a model-agnostic framework, the Long-term Dialogue Agent (LD-Agent)<n>It incorporates three independently tunable modules dedicated to event perception, persona extraction, and response generation.<n>The effectiveness, generality, and cross-domain capabilities of LD-Agent are empirically demonstrated.
arXiv Detail & Related papers (2024-06-09T21:58:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.