MADial-Bench: Towards Real-world Evaluation of Memory-Augmented Dialogue Generation
- URL: http://arxiv.org/abs/2409.15240v2
- Date: Wed, 23 Oct 2024 17:47:58 GMT
- Title: MADial-Bench: Towards Real-world Evaluation of Memory-Augmented Dialogue Generation
- Authors: Junqing He, Liang Zhu, Rui Wang, Xi Wang, Reza Haffari, Jiaxing Zhang,
- Abstract summary: We present a novel Memory-Augmented Dialogue Benchmark (MADail-Bench) to evaluate the effectiveness of memory-augmented dialogue systems (MADS)
The benchmark assesses two tasks separately: memory retrieval and memory recognition with the incorporation of both passive and proactive memory recall data.
Results from cutting-edge embedding models and large language models on this benchmark indicate the potential for further advancement.
- Score: 15.64077949677469
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Long-term memory is important for chatbots and dialogue systems (DS) to create consistent and human-like conversations, evidenced by numerous developed memory-augmented DS (MADS). To evaluate the effectiveness of such MADS, existing commonly used evaluation metrics, like retrieval accuracy and perplexity (PPL), mainly focus on query-oriented factualness and language quality assessment. However, these metrics often lack practical value. Moreover, the evaluation dimensions are insufficient for human-like assessment in DS. Regarding memory-recalling paradigms, current evaluation schemes only consider passive memory retrieval while ignoring diverse memory recall with rich triggering factors, e.g., emotions and surroundings, which can be essential in emotional support scenarios. To bridge the gap, we construct a novel Memory-Augmented Dialogue Benchmark (MADail-Bench) covering various memory-recalling paradigms based on cognitive science and psychology theories. The benchmark assesses two tasks separately: memory retrieval and memory recognition with the incorporation of both passive and proactive memory recall data. We introduce new scoring criteria to the evaluation, including memory injection, emotion support (ES) proficiency, and intimacy, to comprehensively assess generated responses. Results from cutting-edge embedding models and large language models on this benchmark indicate the potential for further advancement. Extensive testing further reveals correlations between memory injection, ES proficiency, and intimacy.
Related papers
- Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests [89.09172401497213]
We examine three evaluation paradigms: large question-answering benchmarks, interactive games, and cognitive tests.
We compile a suite of targeted tests that measure cognitive abilities deemed essential for effective language use.
Our analyses reveal that interactive games are superior to standard benchmarks in discriminating models.
arXiv Detail & Related papers (2025-02-20T08:36:58Z) - Event Segmentation Applications in Large Language Model Enabled Automated Recall Assessments [0.0]
Event segmentation is central to how we perceive, encode, and recall experiences.
Current research methodologies rely heavily on human for assessing segmentation patterns and recall ability.
We leverage Large Language Models (LLMs) to automate event segmentation and assess recall.
arXiv Detail & Related papers (2025-02-19T00:48:51Z) - On Memory Construction and Retrieval for Personalized Conversational Agents [69.46887405020186]
We propose SeCom, a method that constructs a memory bank with topical segments by introducing a conversation model, while performing memory retrieval based on compressed memory units.
Experimental results show that SeCom outperforms turn-level, session-level, and several summarization-based methods on long-term conversation benchmarks such as LOCOMO and Long-MT-Bench+.
arXiv Detail & Related papers (2025-02-08T14:28:36Z) - Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation [39.69790911626182]
The incorporation of memory into agents is essential for numerous tasks within the domain of Reinforcement Learning (RL)
The term memory'' encompasses a wide range of concepts, which, coupled with the lack of a unified methodology for validating an agent's memory, leads to erroneous judgments about agents' memory capabilities.
This paper aims to streamline the concept of memory in RL by providing practical precise definitions of agent memory types.
arXiv Detail & Related papers (2024-12-09T14:34:31Z) - Stable Hadamard Memory: Revitalizing Memory-Augmented Agents for Reinforcement Learning [64.93848182403116]
Current deep-learning memory models struggle in reinforcement learning environments that are partially observable and long-term.
We introduce the Stable Hadamard Memory, a novel memory model for reinforcement learning agents.
Our approach significantly outperforms state-of-the-art memory-based methods on challenging partially observable benchmarks.
arXiv Detail & Related papers (2024-10-14T03:50:17Z) - Ever-Evolving Memory by Blending and Refining the Past [30.63352929849842]
CREEM is a novel memory system for long-term conversation.
It seamlessly connects past and present information, while also possessing the ability to forget obstructive information.
arXiv Detail & Related papers (2024-03-03T08:12:59Z) - A Framework for Inference Inspired by Human Memory Mechanisms [9.408704431898279]
We propose a PMI framework that consists of perception, memory and inference components.
The memory module comprises working and long-term memory, with the latter endowed with a higher-order structure to retain extensive and complex relational knowledge and experience.
We apply our PMI to improve prevailing Transformers and CNN models on question-answering tasks like bAbI-20k and Sort-of-CLEVR datasets.
arXiv Detail & Related papers (2023-10-01T08:12:55Z) - MemoryBank: Enhancing Large Language Models with Long-Term Memory [7.654404043517219]
We propose MemoryBank, a novel memory mechanism tailored for Large Language Models.
MemoryBank enables the models to summon relevant memories, continually evolve through continuous memory updates, comprehend, and adapt to a user personality by synthesizing information from past interactions.
arXiv Detail & Related papers (2023-05-17T14:40:29Z) - Recall, Robustness, and Lexicographic Evaluation [49.13362412522523]
The application of recall without a formal evaluative motivation has led to criticism of recall as a vague or inappropriate measure.
Our analysis is composed of three tenets: recall, robustness, and lexicographic evaluation.
arXiv Detail & Related papers (2023-02-22T13:39:54Z) - Learning Human Cognitive Appraisal Through Reinforcement Memory Unit [63.83306892013521]
We propose a memory-enhancing mechanism for recurrent neural networks that exploits the effect of human cognitive appraisal in sequential assessment tasks.
We conceptualize the memory-enhancing mechanism as Reinforcement Memory Unit (RMU) that contains an appraisal state together with two positive and negative reinforcement memories.
arXiv Detail & Related papers (2022-08-06T08:56:55Z) - Self-Attentive Associative Memory [69.40038844695917]
We propose to separate the storage of individual experiences (item memory) and their occurring relationships (relational memory)
We achieve competitive results with our proposed two-memory model in a diversity of machine learning tasks.
arXiv Detail & Related papers (2020-02-10T03:27:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.