Related papers: RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction

RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction

URL: http://arxiv.org/abs/2601.06966v1
Date: Sun, 11 Jan 2026 15:49:36 GMT
Title: RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction
Authors: Haonan Bian, Zhiyuan Yao, Sen Hu, Zishan Xu, Shaolei Zhang, Yifu Guo, Ziliang Yang, Xueran Han, Huacan Wang, Ronghao Chen,
Abstract summary: We introduce **RealMem**, the first benchmark grounded in realistic project scenarios.<n>RealMem comprises over 2,000 cross-session dialogues across eleven scenarios, utilizing natural user queries for evaluation.<n>We propose a pipeline that integrates Project Foundation Construction, Multi-Agent Dialogue Generation, and Memory synthesis and Schedule Management to simulate the dynamic evolution of memory.
Score: 21.670389104174536
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As Large Language Models (LLMs) evolve from static dialogue interfaces to autonomous general agents, effective memory is paramount to ensuring long-term consistency. However, existing benchmarks primarily focus on casual conversation or task-oriented dialogue, failing to capture **"long-term project-oriented"** interactions where agents must track evolving goals. To bridge this gap, we introduce **RealMem**, the first benchmark grounded in realistic project scenarios. RealMem comprises over 2,000 cross-session dialogues across eleven scenarios, utilizing natural user queries for evaluation. We propose a synthesis pipeline that integrates Project Foundation Construction, Multi-Agent Dialogue Generation, and Memory and Schedule Management to simulate the dynamic evolution of memory. Experiments reveal that current memory systems face significant challenges in managing the long-term project states and dynamic context dependencies inherent in real-world projects. Our code and datasets are available at [https://github.com/AvatarMemory/RealMemBench](https://github.com/AvatarMemory/RealMemBench).

Related papers

RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies [54.23445842621374]
Memory is critical for long-horizon and history-dependent robotic manipulation.<n>Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms.<n>We introduce RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models.
arXiv Detail & Related papers (2026-03-04T21:59:32Z)
LifeBench: A Benchmark for Long-Horizon Multi-Source Memory [22.24847456134897]
We introduce Lifebench, which features densely connected, long-horizon event simulation.<n>Lifebench pushes AI agents beyond simple recall, requiring the integration of declarative and non-declarative memory reasoning.<n>Performance results show that top-tier, state-of-the-art memory systems reach just 55.2% accuracy.
arXiv Detail & Related papers (2026-03-04T06:42:17Z)
RMBench: Memory-Dependent Robotic Manipulation Benchmark with Insights into Policy Design [77.30163153176954]
RMBench is a simulation benchmark comprising 9 manipulation tasks that span multiple levels of memory complexity.<n>Mem-0 is a modular manipulation policy with explicit memory components designed to support controlled ablation studies.<n>We identify memory-related limitations in existing policies and provide empirical insights into how architectural design choices influence memory performance.
arXiv Detail & Related papers (2026-03-01T18:59:59Z)
EverMemBench: Benchmarking Long-Term Interactive Memory in Large Language Models [16.865998112859604]
We introduce EverMemBench, a benchmark featuring multi-party, multi-group conversations spanning over 1 million tokens.<n>EverMemBench evaluates memory systems across three dimensions through 1,000+ QA pairs.
arXiv Detail & Related papers (2026-02-01T16:13:08Z)
MetaMem: Evolving Meta-Memory for Knowledge Utilization through Self-Reflective Symbolic Optimization [57.17751568928966]
We propose MetaMem, a framework that augments memory systems with a self-evolving meta-memory.<n>During meta-memory optimization, MetaMem iteratively distills transferable knowledge utilization experiences across different tasks.<n>Extensive experiments demonstrate the effectiveness of MetaMem, which significantly outperforms strong baselines by over 3.6%.
arXiv Detail & Related papers (2026-01-27T04:46:23Z)
HiMem: Hierarchical Long-Term Memory for LLM Long-Horizon Agents [3.9396865837159822]
HiMem is a hierarchical long-term memory framework for long-horizon dialogues.<n>It supports memory construction, retrieval, and dynamic updating during sustained interactions.<n>Results show HiMem consistently outperforms representative baselines in accuracy, consistency, and long-term reasoning.
arXiv Detail & Related papers (2026-01-10T01:26:01Z)
EvolMem: A Cognitive-Driven Benchmark for Multi-Session Dialogue Memory [63.84216832544323]
EvolMem is a new benchmark for assessing multi-session memory capabilities of large language models (LLMs) and agent systems.<n>To construct the benchmark, we introduce a hybrid data synthesis framework that consists of topic-initiated generation and narrative-inspired transformations.<n>Extensive evaluation reveals that no LLM consistently outperforms others across all memory dimensions.
arXiv Detail & Related papers (2026-01-07T03:14:42Z)
Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents [76.76004970226485]
Long-term memory is a critical capability for multimodal large language model (MLLM) agents.<n>Mem-Gallery is a new benchmark for evaluating multimodal long-term conversational memory in MLLM agents.
arXiv Detail & Related papers (2026-01-07T02:03:13Z)
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory [89.65731902036669]
Evo-Memory is a streaming benchmark and framework for evaluating self-evolving memory in large language model (LLM) agents.<n>We evaluate over ten representative memory modules and evaluate them across 10 diverse multi-turn goal-oriented and single-turn reasoning and QA datasets.
arXiv Detail & Related papers (2025-11-25T21:08:07Z)
MEMTRACK: Evaluating Long-Term Memory and State Tracking in Multi-Platform Dynamic Agent Environments [6.12783571098263]
MEMTRACK is a benchmark designed to evaluate long-term memory and state tracking in multi-platform agent environments.<n>Each benchmark instance provides a chronologically platform-interleaved timeline, with noisy, conflicting, cross-referring information.<n>Our benchmark tests memory capabilities such as acquistion, selection and conflict resolution.
arXiv Detail & Related papers (2025-10-01T18:34:03Z)
Meta-Memory: Retrieving and Integrating Semantic-Spatial Memories for Robot Spatial Reasoning [5.740131013400576]
We propose Meta-Memory, a large language model (LLM)-driven agent that constructs a high-density memory representation of the environment.<n>The key innovation of Meta-Memory lies in its capacity to retrieve and integrate relevant memories through joint reasoning over semantic and spatial modalities.<n> Experimental results show that Meta-Memory significantly outperforms state-of-the-art methods on both the SpaceLocQA and the public NaVQA benchmarks.
arXiv Detail & Related papers (2025-09-25T05:22:52Z)
FindingDory: A Benchmark to Evaluate Memory in Embodied Agents [49.18498389833308]
We introduce a new benchmark for long-range embodied tasks in the Habitat simulator.<n>This benchmark evaluates memory-based capabilities across 60 tasks requiring sustained engagement and contextual awareness.
arXiv Detail & Related papers (2025-06-18T17:06:28Z)
Interpersonal Memory Matters: A New Task for Proactive Dialogue Utilizing Conversational History [13.389395397698035]
We introduce a novel task named Memory-aware Proactive Dialogue (MapDia)<n>By the task, we then propose an automatic data construction method and create the first Chinese Memory-aware Proactive dataset (ChMapData)<n> Furthermore, we introduce a joint framework based on Retrieval Augmented Generation (RAG), featuring three modules: Topic Summarization, Topic Retrieval, and Proactive Topic-shifting Detection and Generation.
arXiv Detail & Related papers (2025-03-07T05:19:17Z)
Hello Again! LLM-powered Personalized Agent for Long-term Dialogue [63.65128176360345]
We introduce a model-agnostic framework, the Long-term Dialogue Agent (LD-Agent)<n>It incorporates three independently tunable modules dedicated to event perception, persona extraction, and response generation.<n>The effectiveness, generality, and cross-domain capabilities of LD-Agent are empirically demonstrated.
arXiv Detail & Related papers (2024-06-09T21:58:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.