Related papers: According to Me: Long-Term Personalized Referential Memory QA

According to Me: Long-Term Personalized Referential Memory QA

URL: http://arxiv.org/abs/2603.01990v1
Date: Mon, 02 Mar 2026 15:42:29 GMT
Title: According to Me: Long-Term Personalized Referential Memory QA
Authors: Jingbiao Mei, Jinghong Chen, Guangyu Yang, Xinyu Hou, Margaret Li, Bill Byrne,
Abstract summary: ATM-Bench is the first benchmark for multimodal, multi-source personalized referential Memory QA.<n>Guided Memory (SGM) structurally represents memory items originated from different sources.<n>We find poor performance (under 20% accuracy) on the ATM-Bench-Hard set.
Score: 27.402232752643275
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Personalized AI assistants must recall and reason over long-term user memory, which naturally spans multiple modalities and sources such as images, videos, and emails. However, existing Long-term Memory benchmarks focus primarily on dialogue history, failing to capture realistic personalized references grounded in lived experience. We introduce ATM-Bench, the first benchmark for multimodal, multi-source personalized referential Memory QA. ATM-Bench contains approximately four years of privacy-preserving personal memory data and human-annotated question-answer pairs with ground-truth memory evidence, including queries that require resolving personal references, multi-evidence reasoning from multi-source and handling conflicting evidence. We propose Schema-Guided Memory (SGM) to structurally represent memory items originated from different sources. In experiments, we implement 5 state-of-the-art memory systems along with a standard RAG baseline and evaluate variants with different memory ingestion, retrieval, and answer generation techniques. We find poor performance (under 20\% accuracy) on the ATM-Bench-Hard set, and that SGM improves performance over Descriptive Memory commonly adopted in prior works. Code available at: https://github.com/JingbiaoMei/ATM-Bench

Related papers

LifeBench: A Benchmark for Long-Horizon Multi-Source Memory [22.24847456134897]
We introduce Lifebench, which features densely connected, long-horizon event simulation.<n>Lifebench pushes AI agents beyond simple recall, requiring the integration of declarative and non-declarative memory reasoning.<n>Performance results show that top-tier, state-of-the-art memory systems reach just 55.2% accuracy.
arXiv Detail & Related papers (2026-03-04T06:42:17Z)
MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks [55.145729491377374]
Existing evaluations of agents with memory typically assess memorization and action in isolation.<n>We introduce MemoryArena, a unified evaluation gym for benchmarking agent memory in multi-session Memory-Agent-Environment loops.<n> MemoryArena supports evaluation across web navigation, preference-constrained planning, progressive information search, and sequential formal reasoning.
arXiv Detail & Related papers (2026-02-18T09:49:14Z)
EverMemBench: Benchmarking Long-Term Interactive Memory in Large Language Models [16.865998112859604]
We introduce EverMemBench, a benchmark featuring multi-party, multi-group conversations spanning over 1 million tokens.<n>EverMemBench evaluates memory systems across three dimensions through 1,000+ QA pairs.
arXiv Detail & Related papers (2026-02-01T16:13:08Z)
OP-Bench: Benchmarking Over-Personalization for Memory-Augmented Personalized Conversational Agents [55.27061195244624]
We formalize over-personalization into three types: Irrelevance, Repetition, and Sycophancy.<n>Agents tend to retrieve and over-attend to user memories even when unnecessary.<n>Our work takes an initial step toward more controllable and appropriate personalization in memory-augmented dialogue systems.
arXiv Detail & Related papers (2026-01-20T08:27:13Z)
EvolMem: A Cognitive-Driven Benchmark for Multi-Session Dialogue Memory [63.84216832544323]
EvolMem is a new benchmark for assessing multi-session memory capabilities of large language models (LLMs) and agent systems.<n>To construct the benchmark, we introduce a hybrid data synthesis framework that consists of topic-initiated generation and narrative-inspired transformations.<n>Extensive evaluation reveals that no LLM consistently outperforms others across all memory dimensions.
arXiv Detail & Related papers (2026-01-07T03:14:42Z)
Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents [76.76004970226485]
Long-term memory is a critical capability for multimodal large language model (MLLM) agents.<n>Mem-Gallery is a new benchmark for evaluating multimodal long-term conversational memory in MLLM agents.
arXiv Detail & Related papers (2026-01-07T02:03:13Z)
Evaluating Long-Term Memory for Long-Context Question Answering [100.1267054069757]
We present a systematic evaluation of memory-augmented methods using LoCoMo, a benchmark of synthetic long-context dialogues annotated for question-answering tasks.<n>Our findings show that memory-augmented approaches reduce token usage by over 90% while maintaining competitive accuracy.
arXiv Detail & Related papers (2025-10-27T18:03:50Z)
Multiple Memory Systems for Enhancing the Long-term Memory of Agent [9.43633399280987]
Existing methods, such as MemoryBank and A-MEM, have poor quality of stored memory content.<n>We have designed a multiple memory system inspired by cognitive psychology theory.
arXiv Detail & Related papers (2025-08-21T06:29:42Z)
Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions [22.190297901876278]
We identify four core competencies essential for memory agents: accurate retrieval, test-time learning, long-range understanding, and selective forgetting.<n>Existing benchmarks either rely on limited context lengths or are tailored for static, long-context settings like book-based QA.<n>We introduce MemoryAgentBench, a new benchmark specifically designed for memory agents.
arXiv Detail & Related papers (2025-07-07T17:59:54Z)
From Single to Multi-Granularity: Toward Long-Term Memory Association and Selection of Conversational Agents [79.87304940020256]
Large Language Models (LLMs) have been widely adopted in conversational agents.<n>MemGAS is a framework that enhances memory consolidation by constructing multi-granularity association, adaptive selection, and retrieval.<n> Experiments on four long-term memory benchmarks demonstrate that MemGAS outperforms state-of-the-art methods on both question answer and retrieval tasks.
arXiv Detail & Related papers (2025-05-26T06:13:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.