Related papers: EverMemBench: Benchmarking Long-Term Interactive Memory in Large Language Models

EverMemBench: Benchmarking Long-Term Interactive Memory in Large Language Models

URL: http://arxiv.org/abs/2602.01313v2
Date: Tue, 03 Feb 2026 03:03:41 GMT
Title: EverMemBench: Benchmarking Long-Term Interactive Memory in Large Language Models
Authors: Chuanrui Hu, Tong Li, Xingze Gao, Hongda Chen, Yi Bai, Dannong Xu, Tianwei Lin, Xinda Zhao, Xiaohong Li, Yunyun Han, Jian Pei, Yafeng Deng,
Abstract summary: We introduce EverMemBench, a benchmark featuring multi-party, multi-group conversations spanning over 1 million tokens.<n>EverMemBench evaluates memory systems across three dimensions through 1,000+ QA pairs.
Score: 16.865998112859604
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Long-term conversational memory is essential for LLM-based assistants, yet existing benchmarks focus on dyadic, single-topic dialogues that fail to capture real-world complexity. We introduce EverMemBench, a benchmark featuring multi-party, multi-group conversations spanning over 1 million tokens with temporally evolving information, cross-topic interleaving, and role-specific personas. EverMemBench evaluates memory systems across three dimensions through 1,000+ QA pairs: fine-grained recall, memory awareness, and user profile understanding. Our evaluation reveals critical limitations: (1) multi-hop reasoning collapses in multi-party settings, with even oracle models achieving only 26%; (2) temporal reasoning remains unsolved, requiring version semantics beyond timestamp matching; (3) memory awareness is bottlenecked by retrieval, where current similarity-based methods fail to bridge the semantic gap between queries and implicitly relevant memories. EverMemBench provides a challenging testbed for developing next-generation memory architectures.

Related papers

According to Me: Long-Term Personalized Referential Memory QA [27.402232752643275]
ATM-Bench is the first benchmark for multimodal, multi-source personalized referential Memory QA.<n>Guided Memory (SGM) structurally represents memory items originated from different sources.<n>We find poor performance (under 20% accuracy) on the ATM-Bench-Hard set.
arXiv Detail & Related papers (2026-03-02T15:42:29Z)
AMA: Adaptive Memory via Multi-Agent Collaboration [54.490349689939166]
We propose Adaptive Memory via Multi-Agent Collaboration (AMA), a novel framework that leverages coordinated agents to manage memory across multiple granularities.<n>AMA significantly outperforms state-of-the-art baselines while reducing token consumption by approximately 80% compared to full-context methods.
arXiv Detail & Related papers (2026-01-28T08:09:49Z)
EvolMem: A Cognitive-Driven Benchmark for Multi-Session Dialogue Memory [63.84216832544323]
EvolMem is a new benchmark for assessing multi-session memory capabilities of large language models (LLMs) and agent systems.<n>To construct the benchmark, we introduce a hybrid data synthesis framework that consists of topic-initiated generation and narrative-inspired transformations.<n>Extensive evaluation reveals that no LLM consistently outperforms others across all memory dimensions.
arXiv Detail & Related papers (2026-01-07T03:14:42Z)
Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents [76.76004970226485]
Long-term memory is a critical capability for multimodal large language model (MLLM) agents.<n>Mem-Gallery is a new benchmark for evaluating multimodal long-term conversational memory in MLLM agents.
arXiv Detail & Related papers (2026-01-07T02:03:13Z)
Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs [28.807582003957005]
We present a framework for evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning.<n>We then construct BEAM, a new benchmark comprising 100 conversations and 2,000 validated questions.<n>To enhance model performance, we propose LIGHT-a framework inspired by human cognition that equips LLMs with three complementary memory systems.
arXiv Detail & Related papers (2025-10-31T07:29:52Z)
Evaluating Long-Term Memory for Long-Context Question Answering [100.1267054069757]
We present a systematic evaluation of memory-augmented methods using LoCoMo, a benchmark of synthetic long-context dialogues annotated for question-answering tasks.<n>Our findings show that memory-augmented approaches reduce token usage by over 90% while maintaining competitive accuracy.
arXiv Detail & Related papers (2025-10-27T18:03:50Z)
From Single to Multi-Granularity: Toward Long-Term Memory Association and Selection of Conversational Agents [79.87304940020256]
Large Language Models (LLMs) have been widely adopted in conversational agents.<n>MemGAS is a framework that enhances memory consolidation by constructing multi-granularity association, adaptive selection, and retrieval.<n> Experiments on four long-term memory benchmarks demonstrate that MemGAS outperforms state-of-the-art methods on both question answer and retrieval tasks.
arXiv Detail & Related papers (2025-05-26T06:13:07Z)
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory [68.97819665784442]
We introduce LongMemEval, a benchmark designed to evaluate five core long-term memory abilities of chat assistants.<n>LongMemEval presents a significant challenge to existing long-term memory systems.<n>We present a unified framework that breaks down the long-term memory design into three stages: indexing, retrieval, and reading.
arXiv Detail & Related papers (2024-10-14T17:59:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.