LifeBench: A Benchmark for Long-Horizon Multi-Source Memory
- URL: http://arxiv.org/abs/2603.03781v1
- Date: Wed, 04 Mar 2026 06:42:17 GMT
- Title: LifeBench: A Benchmark for Long-Horizon Multi-Source Memory
- Authors: Zihao Cheng, Weixin Wang, Yu Zhao, Ziyang Ren, Jiaxuan Chen, Ruiyang Xu, Shuai Huang, Yang Chen, Guowei Li, Mengshi Wang, Yi Xie, Ren Zhu, Zeren Jiang, Keda Lu, Yihong Li, Xiaoliang Wang, Liwei Liu, Cam-Tu Nguyen,
- Abstract summary: We introduce Lifebench, which features densely connected, long-horizon event simulation.<n>Lifebench pushes AI agents beyond simple recall, requiring the integration of declarative and non-declarative memory reasoning.<n>Performance results show that top-tier, state-of-the-art memory systems reach just 55.2% accuracy.
- Score: 22.24847456134897
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Long-term memory is fundamental for personalized agents capable of accumulating knowledge, reasoning over user experiences, and adapting across time. However, existing memory benchmarks primarily target declarative memory, specifically semantic and episodic types, where all information is explicitly presented in dialogues. In contrast, real-world actions are also governed by non-declarative memory, including habitual and procedural types, and need to be inferred from diverse digital traces. To bridge this gap, we introduce Lifebench, which features densely connected, long-horizon event simulation. It pushes AI agents beyond simple recall, requiring the integration of declarative and non-declarative memory reasoning across diverse and temporally extended contexts. Building such a benchmark presents two key challenges: ensuring data quality and scalability. We maintain data quality by employing real-world priors, including anonymized social surveys, map APIs, and holiday-integrated calendars, thus enforcing fidelity, diversity and behavioral rationality within the dataset. Towards scalability, we draw inspiration from cognitive science and structure events according to their partonomic hierarchy; enabling efficient parallel generation while maintaining global coherence. Performance results show that top-tier, state-of-the-art memory systems reach just 55.2\% accuracy, highlighting the inherent difficulty of long-horizon retrieval and multi-source integration within our proposed benchmark. The dataset and data synthesis code are available at https://github.com/1754955896/LifeBench.
Related papers
- RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies [54.23445842621374]
Memory is critical for long-horizon and history-dependent robotic manipulation.<n>Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms.<n>We introduce RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models.
arXiv Detail & Related papers (2026-03-04T21:59:32Z) - Graph-based Agent Memory: Taxonomy, Techniques, and Applications [63.70340159016138]
Memory emerges as the core module in the Large Language Model (LLM)-based agents for long-horizon complex tasks.<n>Among diverse paradigms, graph stands out as a powerful structure for agent memory due to the intrinsic capabilities to model relational dependencies.<n>This survey presents a comprehensive review of agent memory from the graph-based perspective.
arXiv Detail & Related papers (2026-02-05T13:49:05Z) - EverMemBench: Benchmarking Long-Term Interactive Memory in Large Language Models [16.865998112859604]
We introduce EverMemBench, a benchmark featuring multi-party, multi-group conversations spanning over 1 million tokens.<n>EverMemBench evaluates memory systems across three dimensions through 1,000+ QA pairs.
arXiv Detail & Related papers (2026-02-01T16:13:08Z) - Rethinking Memory Mechanisms of Foundation Agents in the Second Half: A Survey [211.01908189012184]
Memory, with hundreds of papers released this year, emerges as the critical solution to fill the utility gap.<n>We provide a unified view of foundation agent memory along three dimensions.<n>We then analyze how memory is instantiated and operated under different agent topologies.
arXiv Detail & Related papers (2026-01-14T07:38:38Z) - The AI Hippocampus: How Far are We From Human Memory? [77.04745635827278]
Implicit memory refers to the knowledge embedded within the internal parameters of pre-trained transformers.<n>Explicit memory involves external storage and retrieval components designed to augment model outputs with dynamic, queryable knowledge representations.<n>Agentic memory introduces persistent, temporally extended memory structures within autonomous agents.
arXiv Detail & Related papers (2026-01-14T03:24:08Z) - RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction [21.670389104174536]
We introduce **RealMem**, the first benchmark grounded in realistic project scenarios.<n>RealMem comprises over 2,000 cross-session dialogues across eleven scenarios, utilizing natural user queries for evaluation.<n>We propose a pipeline that integrates Project Foundation Construction, Multi-Agent Dialogue Generation, and Memory synthesis and Schedule Management to simulate the dynamic evolution of memory.
arXiv Detail & Related papers (2026-01-11T15:49:36Z) - HiMem: Hierarchical Long-Term Memory for LLM Long-Horizon Agents [3.9396865837159822]
HiMem is a hierarchical long-term memory framework for long-horizon dialogues.<n>It supports memory construction, retrieval, and dynamic updating during sustained interactions.<n>Results show HiMem consistently outperforms representative baselines in accuracy, consistency, and long-term reasoning.
arXiv Detail & Related papers (2026-01-10T01:26:01Z) - Memory Matters More: Event-Centric Memory as a Logic Map for Agent Searching and Reasoning [55.251697395358285]
Large language models (LLMs) are increasingly deployed as intelligent agents that reason, plan, and interact with their environments.<n>To effectively scale to long-horizon scenarios, a key capability for such agents is a memory mechanism that can retain, organize, and retrieve past experiences.<n>We propose CompassMem, an event-centric memory framework inspired by Event Theory.
arXiv Detail & Related papers (2026-01-08T08:44:07Z) - EvolMem: A Cognitive-Driven Benchmark for Multi-Session Dialogue Memory [63.84216832544323]
EvolMem is a new benchmark for assessing multi-session memory capabilities of large language models (LLMs) and agent systems.<n>To construct the benchmark, we introduce a hybrid data synthesis framework that consists of topic-initiated generation and narrative-inspired transformations.<n>Extensive evaluation reveals that no LLM consistently outperforms others across all memory dimensions.
arXiv Detail & Related papers (2026-01-07T03:14:42Z) - FindingDory: A Benchmark to Evaluate Memory in Embodied Agents [49.18498389833308]
We introduce a new benchmark for long-range embodied tasks in the Habitat simulator.<n>This benchmark evaluates memory-based capabilities across 60 tasks requiring sustained engagement and contextual awareness.
arXiv Detail & Related papers (2025-06-18T17:06:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.