Evaluating Very Long-Term Conversational Memory of LLM Agents
- URL: http://arxiv.org/abs/2402.17753v1
- Date: Tue, 27 Feb 2024 18:42:31 GMT
- Title: Evaluating Very Long-Term Conversational Memory of LLM Agents
- Authors: Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal,
Francesco Barbieri, Yuwei Fang
- Abstract summary: We introduce a machine-human pipeline to generate high-quality, very long-term dialogues.
We equip each agent with the capability of sharing and reacting to images.
The generated conversations are verified and edited by human annotators for long-range consistency.
- Score: 95.84027826745609
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Existing works on long-term open-domain dialogues focus on evaluating model
responses within contexts spanning no more than five chat sessions. Despite
advancements in long-context large language models (LLMs) and retrieval
augmented generation (RAG) techniques, their efficacy in very long-term
dialogues remains unexplored. To address this research gap, we introduce a
machine-human pipeline to generate high-quality, very long-term dialogues by
leveraging LLM-based agent architectures and grounding their dialogues on
personas and temporal event graphs. Moreover, we equip each agent with the
capability of sharing and reacting to images. The generated conversations are
verified and edited by human annotators for long-range consistency and
grounding to the event graphs. Using this pipeline, we collect LoCoMo, a
dataset of very long-term conversations, each encompassing 300 turns and 9K
tokens on avg., over up to 35 sessions. Based on LoCoMo, we present a
comprehensive evaluation benchmark to measure long-term memory in models,
encompassing question answering, event summarization, and multi-modal dialogue
generation tasks. Our experimental results indicate that LLMs exhibit
challenges in understanding lengthy conversations and comprehending long-range
temporal and causal dynamics within dialogues. Employing strategies like
long-context LLMs or RAG can offer improvements but these models still
substantially lag behind human performance.
Related papers
- LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory [68.97819665784442]
This paper introduces LongMemEval, a benchmark designed to evaluate five core long-term memory abilities of chat assistants.
LongMemEval presents a significant challenge to existing long-term memory systems.
We present a unified framework that breaks down the long-term memory design into four design choices.
arXiv Detail & Related papers (2024-10-14T17:59:44Z) - Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models [0.0]
We introduce a dynamic benchmarking system for conversational agents that evaluates their performance through a single, simulated, and lengthy user interaction.
We context switch regularly to interleave the tasks, which constructs a realistic testing scenario in which we assess the Long-Term Memory, Continual Learning, and Information Integration capabilities of the agents.
arXiv Detail & Related papers (2024-09-30T12:01:29Z) - Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA [71.04146366608904]
Long-context modeling capabilities have garnered widespread attention, leading to the emergence of Large Language Models (LLMs) with ultra-context windows.
We propose a novel long-context benchmark, Loong, aligning with realistic scenarios through extended multi-document question answering (QA)
Loong introduces four types of tasks with a range of context lengths: Spotlight Locating, Comparison, Clustering, and Chain of Reasoning.
arXiv Detail & Related papers (2024-06-25T09:42:56Z) - Hello Again! LLM-powered Personalized Agent for Long-term Dialogue [63.65128176360345]
We introduce a model-agnostic framework, the Long-term Dialogue Agent (LD-Agent)
It incorporates three independently tunable modules dedicated to event perception, persona extraction, and response generation.
The effectiveness, generality, and cross-domain capabilities of LD-Agent are empirically demonstrated.
arXiv Detail & Related papers (2024-06-09T21:58:32Z) - Analyzing Temporal Complex Events with Large Language Models? A Benchmark towards Temporal, Long Context Understanding [57.62275091656578]
We refer to the complex events composed of many news articles over an extended period as Temporal Complex Event (TCE)
This paper proposes a novel approach using Large Language Models (LLMs) to systematically extract and analyze the event chain within TCE.
arXiv Detail & Related papers (2024-06-04T16:42:17Z) - Recursively Summarizing Enables Long-Term Dialogue Memory in Large
Language Models [75.98775135321355]
Given a long conversation, large language models (LLMs) fail to recall past information and tend to generate inconsistent responses.
We propose to generate summaries/ memory using large language models (LLMs) to enhance long-term memory ability.
arXiv Detail & Related papers (2023-08-29T04:59:53Z) - Long Time No See! Open-Domain Conversation with Long-Term Persona Memory [37.51131984324123]
We present a novel task of Long-term Memory Conversation (LeMon)
We then build a new dialogue dataset DuLeMon and a dialogue generation framework with Long-Term Memory (LTM) mechanism.
Results on DuLeMon indicate that PLATO-LTM can significantly outperform baselines in terms of long-term dialogue consistency.
arXiv Detail & Related papers (2022-03-11T08:41:14Z) - An Exploratory Study on Long Dialogue Summarization: What Works and
What's Next [33.1899354772074]
We study long dialogue summarization by investigating three strategies to deal with the lengthy input problem and locate relevant information.
Our experimental results on three long dialogue datasets (QMSum, MediaSum, SummScreen) show that the retrieve-then-summarize pipeline models yield the best performance.
arXiv Detail & Related papers (2021-09-10T01:38:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.