Related papers: MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents

MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents

URL: http://arxiv.org/abs/2506.21605v1
Date: Fri, 20 Jun 2025 10:09:23 GMT
Title: MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents
Authors: Haoran Tan, Zeyu Zhang, Chen Ma, Xu Chen, Quanyu Dai, Zhenhua Dong,
Abstract summary: We construct a more comprehensive dataset and benchmark to evaluate the memory capability of LLM-based agents.<n>Our dataset incorporates factual memory and reflective memory as different levels, and proposes participation and observation as various interactive scenarios.<n>Based on our dataset, we present a benchmark, named MemBench, to evaluate the memory capability of LLM-based agents from multiple aspects, including their effectiveness, efficiency, and capacity.
Score: 26.647812147336538
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent works have highlighted the significance of memory mechanisms in LLM-based agents, which enable them to store observed information and adapt to dynamic environments. However, evaluating their memory capabilities still remains challenges. Previous evaluations are commonly limited by the diversity of memory levels and interactive scenarios. They also lack comprehensive metrics to reflect the memory capabilities from multiple aspects. To address these problems, in this paper, we construct a more comprehensive dataset and benchmark to evaluate the memory capability of LLM-based agents. Our dataset incorporates factual memory and reflective memory as different levels, and proposes participation and observation as various interactive scenarios. Based on our dataset, we present a benchmark, named MemBench, to evaluate the memory capability of LLM-based agents from multiple aspects, including their effectiveness, efficiency, and capacity. To benefit the research community, we release our dataset and project at https://github.com/import-myself/Membench.

Related papers

Hierarchical Memory for High-Efficiency Long-Term Reasoning in LLM Agents [19.04968632268433]
We propose a hierarchical memory architecture for Large Language Model Agents (LLM Agents)<n>Each memory vector is embedded with a positional index encoding pointing to its semantically related sub-memories in the next layer.<n>During the reasoning phase, an index-based routing mechanism enables efficient, layer-by-layer retrieval without performing exhaustive similarity computations.
arXiv Detail & Related papers (2025-07-23T12:45:44Z)
Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions [19.51727855436013]
We term agents with memory mechanisms as memory agents.<n>In this paper, we identify four core competencies essential for memory agents: accurate retrieval, test-time learning, long-range understanding, and conflict resolution.<n>Existing datasets either rely on limited context lengths or are tailored for static, long-context settings like book-based QA.<n>No existing benchmarks cover all four competencies. Therefore, we introduce MemoryAgentBench, a new benchmark specifically designed for memory agents.
arXiv Detail & Related papers (2025-07-07T17:59:54Z)
MemOS: A Memory OS for AI System [116.87568350346537]
Large Language Models (LLMs) have become an essential infrastructure for Artificial General Intelligence (AGI)<n>Existing models mainly rely on static parameters and short-lived contextual states, limiting their ability to track user preferences or update knowledge over extended periods.<n>MemOS is a memory operating system that treats memory as a manageable system resource.
arXiv Detail & Related papers (2025-07-04T17:21:46Z)
How Memory Management Impacts LLM Agents: An Empirical Study of Experience-Following Behavior [49.62361184944454]
Memory is a critical component in large language model (LLM)-based agents.<n>We study how memory management choices impact the LLM agents' behavior, especially their long-term performance.
arXiv Detail & Related papers (2025-05-21T22:35:01Z)
Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions [55.19217798774033]
Memory is a fundamental component of AI systems, underpinning large language models (LLMs)-based agents.<n>In this survey, we first categorize memory representations into parametric and contextual forms.<n>We then introduce six fundamental memory operations: Consolidation, Updating, Indexing, Forgetting, Retrieval, and Compression.
arXiv Detail & Related papers (2025-05-01T17:31:33Z)
On the Structural Memory of LLM Agents [20.529239764968654]
Memory plays a pivotal role in enabling large language model(LLM)-based agents to engage in complex and long-term interactions.<n>This paper investigates how memory structures and memory retrieval methods affect the performance of LLM-based agents.
arXiv Detail & Related papers (2024-12-17T04:30:00Z)
MemSim: A Bayesian Simulator for Evaluating Memory of LLM-based Personal Assistants [64.41695570145673]
We propose MemSim, a Bayesian simulator designed to automatically construct reliable questions and answers (QAs) from generated user messages. Based on MemSim, we generate a dataset in the daily-life scenario, named MemDaily, and conduct extensive experiments to assess the effectiveness of our approach.
arXiv Detail & Related papers (2024-09-30T10:19:04Z)
A Survey on the Memory Mechanism of Large Language Model based Agents [66.4963345269611]
Large language model (LLM) based agents have recently attracted much attention from the research and industry communities. LLM-based agents are featured in their self-evolving capability, which is the basis for solving real-world problems. The key component to support agent-environment interactions is the memory of the agents.
arXiv Detail & Related papers (2024-04-21T01:49:46Z)
PerLTQA: A Personal Long-Term Memory Dataset for Memory Classification, Retrieval, and Synthesis in Question Answering [27.815507347725344]
This research introduces PerLTQA, an innovative QA dataset that combines semantic and episodic memories. PerLTQA features two types of memory and a benchmark of 8,593 questions for 30 characters. We propose a novel framework for memory integration and generation, consisting of three main components: Memory Classification, Memory Retrieval, and Memory Synthesis.
arXiv Detail & Related papers (2024-02-26T04:09:53Z)
RET-LLM: Towards a General Read-Write Memory for Large Language Models [53.288356721954514]
RET-LLM is a novel framework that equips large language models with a general write-read memory unit. Inspired by Davidsonian semantics theory, we extract and save knowledge in the form of triplets. Our framework exhibits robust performance in handling temporal-based question answering tasks.
arXiv Detail & Related papers (2023-05-23T17:53:38Z)
SCM: Enhancing Large Language Model with Self-Controlled Memory Framework [54.33686574304374]
Large Language Models (LLMs) are constrained by their inability to process lengthy inputs, resulting in the loss of critical historical information.<n>We propose the Self-Controlled Memory (SCM) framework to enhance the ability of LLMs to maintain long-term memory and recall relevant information.
arXiv Detail & Related papers (2023-04-26T07:25:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.