Related papers: Locomo-Plus: Beyond-Factual Cognitive Memory Evaluation Framework for LLM Agents

Locomo-Plus: Beyond-Factual Cognitive Memory Evaluation Framework for LLM Agents

URL: http://arxiv.org/abs/2602.10715v1
Date: Wed, 11 Feb 2026 10:22:35 GMT
Title: Locomo-Plus: Beyond-Factual Cognitive Memory Evaluation Framework for LLM Agents
Authors: Yifei Li, Weidong Guo, Lingling Zhang, Rongman Xu, Muye Huang, Hui Liu, Lijiao Xu, Yu Xu, Jun Liu,
Abstract summary: We introduce textbfLoCoMo-Plus, a benchmark for assessing cognitive memory under cue--trigger semantic disconnect.<n>We show that conventional string-matching metrics and explicit task-type prompting are misaligned with such scenarios.<n> Experiments across diverse backbone models, retrieval-based methods, and memory systems demonstrate that cognitive memory remains challenging.
Score: 19.76627324918285
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Long-term conversational memory is a core capability for LLM-based dialogue systems, yet existing benchmarks and evaluation protocols primarily focus on surface-level factual recall. In realistic interactions, appropriate responses often depend on implicit constraints such as user state, goals, or values that are not explicitly queried later. To evaluate this setting, we introduce \textbf{LoCoMo-Plus}, a benchmark for assessing cognitive memory under cue--trigger semantic disconnect, where models must retain and apply latent constraints across long conversational contexts. We further show that conventional string-matching metrics and explicit task-type prompting are misaligned with such scenarios, and propose a unified evaluation framework based on constraint consistency. Experiments across diverse backbone models, retrieval-based methods, and memory systems demonstrate that cognitive memory remains challenging and reveals failures not captured by existing benchmarks. Our code and evaluation framework are publicly available at: https://github.com/xjtuleeyf/Locomo-Plus.

Related papers

AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations [61.6579785305668]
AMemGym is an interactive environment enabling on-policy evaluation and optimization for memory-driven personalization.<n>Our framework provides a scalable, diagnostically rich environment for advancing memory capabilities in conversational agents.
arXiv Detail & Related papers (2026-03-02T15:15:11Z)
RMBench: Memory-Dependent Robotic Manipulation Benchmark with Insights into Policy Design [77.30163153176954]
RMBench is a simulation benchmark comprising 9 manipulation tasks that span multiple levels of memory complexity.<n>Mem-0 is a modular manipulation policy with explicit memory components designed to support controlled ablation studies.<n>We identify memory-related limitations in existing policies and provide empirical insights into how architectural design choices influence memory performance.
arXiv Detail & Related papers (2026-03-01T18:59:59Z)
MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks [55.145729491377374]
Existing evaluations of agents with memory typically assess memorization and action in isolation.<n>We introduce MemoryArena, a unified evaluation gym for benchmarking agent memory in multi-session Memory-Agent-Environment loops.<n> MemoryArena supports evaluation across web navigation, preference-constrained planning, progressive information search, and sequential formal reasoning.
arXiv Detail & Related papers (2026-02-18T09:49:14Z)
EverMemBench: Benchmarking Long-Term Interactive Memory in Large Language Models [16.865998112859604]
We introduce EverMemBench, a benchmark featuring multi-party, multi-group conversations spanning over 1 million tokens.<n>EverMemBench evaluates memory systems across three dimensions through 1,000+ QA pairs.
arXiv Detail & Related papers (2026-02-01T16:13:08Z)
EvolMem: A Cognitive-Driven Benchmark for Multi-Session Dialogue Memory [63.84216832544323]
EvolMem is a new benchmark for assessing multi-session memory capabilities of large language models (LLMs) and agent systems.<n>To construct the benchmark, we introduce a hybrid data synthesis framework that consists of topic-initiated generation and narrative-inspired transformations.<n>Extensive evaluation reveals that no LLM consistently outperforms others across all memory dimensions.
arXiv Detail & Related papers (2026-01-07T03:14:42Z)
Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents [76.76004970226485]
Long-term memory is a critical capability for multimodal large language model (MLLM) agents.<n>Mem-Gallery is a new benchmark for evaluating multimodal long-term conversational memory in MLLM agents.
arXiv Detail & Related papers (2026-01-07T02:03:13Z)
Evaluating Long-Term Memory for Long-Context Question Answering [100.1267054069757]
We present a systematic evaluation of memory-augmented methods using LoCoMo, a benchmark of synthetic long-context dialogues annotated for question-answering tasks.<n>Our findings show that memory-augmented approaches reduce token usage by over 90% while maintaining competitive accuracy.
arXiv Detail & Related papers (2025-10-27T18:03:50Z)
MEMTRACK: Evaluating Long-Term Memory and State Tracking in Multi-Platform Dynamic Agent Environments [6.12783571098263]
MEMTRACK is a benchmark designed to evaluate long-term memory and state tracking in multi-platform agent environments.<n>Each benchmark instance provides a chronologically platform-interleaved timeline, with noisy, conflicting, cross-referring information.<n>Our benchmark tests memory capabilities such as acquistion, selection and conflict resolution.
arXiv Detail & Related papers (2025-10-01T18:34:03Z)
KnowMT-Bench: Benchmarking Knowledge-Intensive Long-Form Question Answering in Multi-Turn Dialogues [58.305425399644086]
Multi-Turn Long-Form Question Answering (MT-LFQA) is a key application paradigm of Large Language Models (LLMs) in knowledge-intensive domains.<n>We introduce textbfKnowMT-Bench, the textitfirst-ever benchmark designed to systematically evaluate MT-LFQA for LLMs across knowledge-intensive fields.
arXiv Detail & Related papers (2025-09-26T04:32:29Z)
MADial-Bench: Towards Real-world Evaluation of Memory-Augmented Dialogue Generation [15.64077949677469]
We present a novel Memory-Augmented Dialogue Benchmark (MADail-Bench) to evaluate the effectiveness of memory-augmented dialogue systems (MADS) The benchmark assesses two tasks separately: memory retrieval and memory recognition with the incorporation of both passive and proactive memory recall data. Results from cutting-edge embedding models and large language models on this benchmark indicate the potential for further advancement.
arXiv Detail & Related papers (2024-09-23T17:38:41Z)
SCM: Enhancing Large Language Model with Self-Controlled Memory Framework [54.33686574304374]
Large Language Models (LLMs) are constrained by their inability to process lengthy inputs, resulting in the loss of critical historical information.<n>We propose the Self-Controlled Memory (SCM) framework to enhance the ability of LLMs to maintain long-term memory and recall relevant information.
arXiv Detail & Related papers (2023-04-26T07:25:31Z)
SWING: Balancing Coverage and Faithfulness for Dialogue Summarization [67.76393867114923]
We propose to utilize natural language inference (NLI) models to improve coverage while avoiding factual inconsistencies. We use NLI to compute fine-grained training signals to encourage the model to generate content in the reference summaries that have not been covered. Experiments on the DialogSum and SAMSum datasets confirm the effectiveness of the proposed approach.
arXiv Detail & Related papers (2023-01-25T09:33:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.