Evaluating Very Long-Term Conversational Memory of LLM Agents
- URL: http://arxiv.org/abs/2402.17753v1
- Date: Tue, 27 Feb 2024 18:42:31 GMT
- Title: Evaluating Very Long-Term Conversational Memory of LLM Agents
- Authors: Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal,
Francesco Barbieri, Yuwei Fang
- Abstract summary: We introduce a machine-human pipeline to generate high-quality, very long-term dialogues.
We equip each agent with the capability of sharing and reacting to images.
The generated conversations are verified and edited by human annotators for long-range consistency.
- Score: 95.84027826745609
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Existing works on long-term open-domain dialogues focus on evaluating model
responses within contexts spanning no more than five chat sessions. Despite
advancements in long-context large language models (LLMs) and retrieval
augmented generation (RAG) techniques, their efficacy in very long-term
dialogues remains unexplored. To address this research gap, we introduce a
machine-human pipeline to generate high-quality, very long-term dialogues by
leveraging LLM-based agent architectures and grounding their dialogues on
personas and temporal event graphs. Moreover, we equip each agent with the
capability of sharing and reacting to images. The generated conversations are
verified and edited by human annotators for long-range consistency and
grounding to the event graphs. Using this pipeline, we collect LoCoMo, a
dataset of very long-term conversations, each encompassing 300 turns and 9K
tokens on avg., over up to 35 sessions. Based on LoCoMo, we present a
comprehensive evaluation benchmark to measure long-term memory in models,
encompassing question answering, event summarization, and multi-modal dialogue
generation tasks. Our experimental results indicate that LLMs exhibit
challenges in understanding lengthy conversations and comprehending long-range
temporal and causal dynamics within dialogues. Employing strategies like
long-context LLMs or RAG can offer improvements but these models still
substantially lag behind human performance.
Related papers
- REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation [51.97224538045096]
We introduce REALTALK, a 21-day corpus of authentic messaging app dialogues.
We compare EI attributes and persona consistency to understand the challenges posed by real-world dialogues.
Our findings reveal that models struggle to simulate a user solely from dialogue history, while fine-tuning on specific user chats improves persona emulation.
arXiv Detail & Related papers (2025-02-18T20:29:01Z) - TReMu: Towards Neuro-Symbolic Temporal Reasoning for LLM-Agents with Memory in Multi-Session Dialogues [13.638344516302851]
Temporal reasoning in multi-session dialogues presents a significant challenge which has been under-studied.
We introduce an approach to construct a new benchmark by augmenting dialogues from LoCoMo and creating multi-choice QAs.
We also present TReMu, a new framework aimed at enhancing the temporal reasoning capabilities of LLM-agents.
arXiv Detail & Related papers (2025-02-03T18:58:19Z) - InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions [104.90258030688256]
This project introduces disentangled streaming perception, reasoning, and memory mechanisms, enabling real-time interaction with streaming video and audio input.
This project simulates human-like cognition, enabling multimodal large language models to provide continuous and adaptive service over time.
arXiv Detail & Related papers (2024-12-12T18:58:30Z) - Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models [0.0]
We introduce a dynamic benchmarking system for conversational agents that evaluates their performance through a single, simulated, and lengthy user interaction.
We context switch regularly to interleave the tasks, which constructs a realistic testing scenario in which we assess the Long-Term Memory, Continual Learning, and Information Integration capabilities of the agents.
arXiv Detail & Related papers (2024-09-30T12:01:29Z) - Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA [71.04146366608904]
Long-context modeling capabilities have garnered widespread attention, leading to the emergence of Large Language Models (LLMs) with ultra-context windows.
We propose a novel long-context benchmark, Loong, aligning with realistic scenarios through extended multi-document question answering (QA)
Loong introduces four types of tasks with a range of context lengths: Spotlight Locating, Comparison, Clustering, and Chain of Reasoning.
arXiv Detail & Related papers (2024-06-25T09:42:56Z) - Hello Again! LLM-powered Personalized Agent for Long-term Dialogue [63.65128176360345]
We introduce a model-agnostic framework, the Long-term Dialogue Agent (LD-Agent)
It incorporates three independently tunable modules dedicated to event perception, persona extraction, and response generation.
The effectiveness, generality, and cross-domain capabilities of LD-Agent are empirically demonstrated.
arXiv Detail & Related papers (2024-06-09T21:58:32Z) - Recursively Summarizing Enables Long-Term Dialogue Memory in Large
Language Models [75.98775135321355]
Given a long conversation, large language models (LLMs) fail to recall past information and tend to generate inconsistent responses.
We propose to generate summaries/ memory using large language models (LLMs) to enhance long-term memory ability.
arXiv Detail & Related papers (2023-08-29T04:59:53Z) - Long Time No See! Open-Domain Conversation with Long-Term Persona Memory [37.51131984324123]
We present a novel task of Long-term Memory Conversation (LeMon)
We then build a new dialogue dataset DuLeMon and a dialogue generation framework with Long-Term Memory (LTM) mechanism.
Results on DuLeMon indicate that PLATO-LTM can significantly outperform baselines in terms of long-term dialogue consistency.
arXiv Detail & Related papers (2022-03-11T08:41:14Z) - An Exploratory Study on Long Dialogue Summarization: What Works and
What's Next [33.1899354772074]
We study long dialogue summarization by investigating three strategies to deal with the lengthy input problem and locate relevant information.
Our experimental results on three long dialogue datasets (QMSum, MediaSum, SummScreen) show that the retrieve-then-summarize pipeline models yield the best performance.
arXiv Detail & Related papers (2021-09-10T01:38:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.