ES-MemEval: Benchmarking Conversational Agents on Personalized Long-Term Emotional Support
- URL: http://arxiv.org/abs/2602.01885v1
- Date: Mon, 02 Feb 2026 09:58:26 GMT
- Title: ES-MemEval: Benchmarking Conversational Agents on Personalized Long-Term Emotional Support
- Authors: Tiantian Chen, Jiaqi Lu, Ying Shen, Lin Zhang,
- Abstract summary: Large Language Models (LLMs) have shown strong potential as conversational agents.<n>Yet, their effectiveness remains limited by deficiencies in robust long-term memory.<n> ES-MemEval is a benchmark that systematically evaluates five core memory capabilities.<n>EvoEmo is a dataset for personalized long-term emotional support.
- Score: 11.480342895892404
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have shown strong potential as conversational agents. Yet, their effectiveness remains limited by deficiencies in robust long-term memory, particularly in complex, long-term web-based services such as online emotional support. However, existing long-term dialogue benchmarks primarily focus on static and explicit fact retrieval, failing to evaluate agents in critical scenarios where user information is dispersed, implicit, and continuously evolving. To address this gap, we introduce ES-MemEval, a comprehensive benchmark that systematically evaluates five core memory capabilities: information extraction, temporal reasoning, conflict detection, abstention, and user modeling, in long-term emotional support settings, covering question answering, summarization, and dialogue generation tasks. To support the benchmark, we also propose EvoEmo, a multi-session dataset for personalized long-term emotional support that captures fragmented, implicit user disclosures and evolving user states. Extensive experiments on open-source long-context, commercial, and retrieval-augmented (RAG) LLMs show that explicit long-term memory is essential for reducing hallucinations and enabling effective personalization. At the same time, RAG improves factual consistency but struggles with temporal dynamics and evolving user states. These findings highlight both the potential and limitations of current paradigms and motivate more robust integration of memory and retrieval for long-term personalized dialogue systems.
Related papers
- AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts [78.33143446024485]
We introduce textbfAgentLongBench, which evaluates agents through simulated environment rollouts based on Lateral Thinking Puzzles.<n>This framework generates rigorous interaction trajectories across knowledge-intensive and knowledge-free scenarios.
arXiv Detail & Related papers (2026-01-28T16:05:44Z) - Evaluating Long-Context Reasoning in LLM-Based WebAgents [22.264781808930948]
This paper introduces a benchmark for evaluating long context reasoning capabilities of WebAgents.<n>We observe a dramatic performance degradation as context length increases, with success rates dropping from 40-50% in baseline conditions to less than 10% in long context scenarios.<n>Our detailed error analysis reveals that agents primarily fail due to getting stuck in loops and losing track of original task objectives.
arXiv Detail & Related papers (2025-12-03T22:53:10Z) - O-Mem: Omni Memory System for Personalized, Long Horizon, Self-Evolving Agents [60.1848551962911]
O-Mem is a novel memory framework based on active user profiling.<n>O-Mem supports hierarchical retrieval of persona attributes and topic-related context.
arXiv Detail & Related papers (2025-11-17T16:55:19Z) - RGMem: Renormalization Group-based Memory Evolution for Language Agent User Profile [8.224917568034572]
We propose a self-evolving memory framework, inspired by the ideology of classic renormalization group (RG) in physics.<n>This framework enables to organize the dialogue history in multiple scales.<n>The core innovation of our work lies in modeling memory evolution as a multi-scale process of information compression and emergence.
arXiv Detail & Related papers (2025-10-18T08:16:46Z) - In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents [70.12342024019044]
Large Language Models (LLMs) have made significant progress in open-ended dialogue, yet their inability to retain and retrieve relevant information limits their effectiveness.<n>We propose Reflective Memory Management (RMM), a novel mechanism for long-term dialogue agents, integrating forward- and backward-looking reflections.<n>RMM shows more than 10% accuracy improvement over the baseline without memory management on the LongMemEval dataset.
arXiv Detail & Related papers (2025-03-11T04:15:52Z) - LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory [68.97819665784442]
We introduce LongMemEval, a benchmark designed to evaluate five core long-term memory abilities of chat assistants.<n>LongMemEval presents a significant challenge to existing long-term memory systems.<n>We present a unified framework that breaks down the long-term memory design into three stages: indexing, retrieval, and reading.
arXiv Detail & Related papers (2024-10-14T17:59:44Z) - Hello Again! LLM-powered Personalized Agent for Long-term Dialogue [63.65128176360345]
We introduce a model-agnostic framework, the Long-term Dialogue Agent (LD-Agent)<n>It incorporates three independently tunable modules dedicated to event perception, persona extraction, and response generation.<n>The effectiveness, generality, and cross-domain capabilities of LD-Agent are empirically demonstrated.
arXiv Detail & Related papers (2024-06-09T21:58:32Z) - Evaluating Very Long-Term Conversational Memory of LLM Agents [95.84027826745609]
We introduce a machine-human pipeline to generate high-quality, very long-term dialogues.
We equip each agent with the capability of sharing and reacting to images.
The generated conversations are verified and edited by human annotators for long-range consistency.
arXiv Detail & Related papers (2024-02-27T18:42:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.