AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations
- URL: http://arxiv.org/abs/2603.01966v1
- Date: Mon, 02 Mar 2026 15:15:11 GMT
- Title: AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations
- Authors: Cheng Jiayang, Dongyu Ru, Lin Qiu, Yiyang Li, Xuezhi Cao, Yangqiu Song, Xunliang Cai,
- Abstract summary: AMemGym is an interactive environment enabling on-policy evaluation and optimization for memory-driven personalization.<n>Our framework provides a scalable, diagnostically rich environment for advancing memory capabilities in conversational agents.
- Score: 61.6579785305668
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Long-horizon interactions between users and LLM-based assistants necessitate effective memory management, yet current approaches face challenges in training and evaluation of memory. Existing memory benchmarks rely on static, off-policy data as context, limiting evaluation reliability and scalability. To address these gaps, we introduce AMemGym, an interactive environment enabling on-policy evaluation and optimization for memory-driven personalization. AMemGym employs structured data sampling to predefine user profiles, state-dependent questions, and state evolution trajectories, enabling cost-effective generation of high-quality, evaluation-aligned interactions. LLM-simulated users expose latent states through role-play while maintaining structured state consistency. Comprehensive metrics based on structured data guide both assessment and optimization of assistants. Extensive experiments reveal performance gaps in existing memory systems (e.g., RAG, long-context LLMs, and agentic memory) and corresponding reasons. AMemGym not only enables effective selection among competing approaches but also can potentially drive the self-evolution of memory management strategies. By bridging structured state evolution with free-form interactions, our framework provides a scalable, diagnostically rich environment for advancing memory capabilities in conversational agents.
Related papers
- MetaMem: Evolving Meta-Memory for Knowledge Utilization through Self-Reflective Symbolic Optimization [57.17751568928966]
We propose MetaMem, a framework that augments memory systems with a self-evolving meta-memory.<n>During meta-memory optimization, MetaMem iteratively distills transferable knowledge utilization experiences across different tasks.<n>Extensive experiments demonstrate the effectiveness of MetaMem, which significantly outperforms strong baselines by over 3.6%.
arXiv Detail & Related papers (2026-01-27T04:46:23Z) - The AI Hippocampus: How Far are We From Human Memory? [77.04745635827278]
Implicit memory refers to the knowledge embedded within the internal parameters of pre-trained transformers.<n>Explicit memory involves external storage and retrieval components designed to augment model outputs with dynamic, queryable knowledge representations.<n>Agentic memory introduces persistent, temporally extended memory structures within autonomous agents.
arXiv Detail & Related papers (2026-01-14T03:24:08Z) - EvolMem: A Cognitive-Driven Benchmark for Multi-Session Dialogue Memory [63.84216832544323]
EvolMem is a new benchmark for assessing multi-session memory capabilities of large language models (LLMs) and agent systems.<n>To construct the benchmark, we introduce a hybrid data synthesis framework that consists of topic-initiated generation and narrative-inspired transformations.<n>Extensive evaluation reveals that no LLM consistently outperforms others across all memory dimensions.
arXiv Detail & Related papers (2026-01-07T03:14:42Z) - Memoria: A Scalable Agentic Memory Framework for Personalized Conversational AI [0.6840655769002751]
Agentic memory is emerging as a key enabler for large language models (LLM)<n>We present Memoria, a modular memory framework that augments LLM-based conversational systems with persistent, interpretable, and context-rich memory.<n>We demonstrate how Memoria enables scalable, personalized conversational artificial intelligence (AI) by bridging the gap between stateless LLM interfaces and agentic memory systems.
arXiv Detail & Related papers (2025-12-14T13:38:06Z) - Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory [89.65731902036669]
Evo-Memory is a streaming benchmark and framework for evaluating self-evolving memory in large language model (LLM) agents.<n>We evaluate over ten representative memory modules and evaluate them across 10 diverse multi-turn goal-oriented and single-turn reasoning and QA datasets.
arXiv Detail & Related papers (2025-11-25T21:08:07Z) - MEMTRACK: Evaluating Long-Term Memory and State Tracking in Multi-Platform Dynamic Agent Environments [6.12783571098263]
MEMTRACK is a benchmark designed to evaluate long-term memory and state tracking in multi-platform agent environments.<n>Each benchmark instance provides a chronologically platform-interleaved timeline, with noisy, conflicting, cross-referring information.<n>Our benchmark tests memory capabilities such as acquistion, selection and conflict resolution.
arXiv Detail & Related papers (2025-10-01T18:34:03Z) - MemOS: A Memory OS for AI System [116.87568350346537]
Large Language Models (LLMs) have become an essential infrastructure for Artificial General Intelligence (AGI)<n>Existing models mainly rely on static parameters and short-lived contextual states, limiting their ability to track user preferences or update knowledge over extended periods.<n>MemOS is a memory operating system that treats memory as a manageable system resource.
arXiv Detail & Related papers (2025-07-04T17:21:46Z) - MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents [26.647812147336538]
We construct a more comprehensive dataset and benchmark to evaluate the memory capability of LLM-based agents.<n>Our dataset incorporates factual memory and reflective memory as different levels, and proposes participation and observation as various interactive scenarios.<n>Based on our dataset, we present a benchmark, named MemBench, to evaluate the memory capability of LLM-based agents from multiple aspects, including their effectiveness, efficiency, and capacity.
arXiv Detail & Related papers (2025-06-20T10:09:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.