Related papers: Evaluating Very Long-Term Conversational Memory of LLM Agents

Evaluating Very Long-Term Conversational Memory of LLM Agents

URL: http://arxiv.org/abs/2402.17753v1
Date: Tue, 27 Feb 2024 18:42:31 GMT
Title: Evaluating Very Long-Term Conversational Memory of LLM Agents
Authors: Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, Yuwei Fang
Abstract summary: We introduce a machine-human pipeline to generate high-quality, very long-term dialogues. We equip each agent with the capability of sharing and reacting to images. The generated conversations are verified and edited by human annotators for long-range consistency.
Score: 95.84027826745609
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Existing works on long-term open-domain dialogues focus on evaluating model responses within contexts spanning no more than five chat sessions. Despite advancements in long-context large language models (LLMs) and retrieval augmented generation (RAG) techniques, their efficacy in very long-term dialogues remains unexplored. To address this research gap, we introduce a machine-human pipeline to generate high-quality, very long-term dialogues by leveraging LLM-based agent architectures and grounding their dialogues on personas and temporal event graphs. Moreover, we equip each agent with the capability of sharing and reacting to images. The generated conversations are verified and edited by human annotators for long-range consistency and grounding to the event graphs. Using this pipeline, we collect LoCoMo, a dataset of very long-term conversations, each encompassing 300 turns and 9K tokens on avg., over up to 35 sessions. Based on LoCoMo, we present a comprehensive evaluation benchmark to measure long-term memory in models, encompassing question answering, event summarization, and multi-modal dialogue generation tasks. Our experimental results indicate that LLMs exhibit challenges in understanding lengthy conversations and comprehending long-range temporal and causal dynamics within dialogues. Employing strategies like long-context LLMs or RAG can offer improvements but these models still substantially lag behind human performance.

Related papers

REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation [51.97224538045096]
We introduce REALTALK, a 21-day corpus of authentic messaging app dialogues. We compare EI attributes and persona consistency to understand the challenges posed by real-world dialogues. Our findings reveal that models struggle to simulate a user solely from dialogue history, while fine-tuning on specific user chats improves persona emulation.
arXiv Detail & Related papers (2025-02-18T20:29:01Z)
TReMu: Towards Neuro-Symbolic Temporal Reasoning for LLM-Agents with Memory in Multi-Session Dialogues [13.638344516302851]
Temporal reasoning in multi-session dialogues presents a significant challenge which has been under-studied. We introduce an approach to construct a new benchmark by augmenting dialogues from LoCoMo and creating multi-choice QAs. We also present TReMu, a new framework aimed at enhancing the temporal reasoning capabilities of LLM-agents.
arXiv Detail & Related papers (2025-02-03T18:58:19Z)
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions [104.90258030688256]
This project introduces disentangled streaming perception, reasoning, and memory mechanisms, enabling real-time interaction with streaming video and audio input. This project simulates human-like cognition, enabling multimodal large language models to provide continuous and adaptive service over time.
arXiv Detail & Related papers (2024-12-12T18:58:30Z)
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory [68.97819665784442]
This paper introduces LongMemEval, a benchmark designed to evaluate five core long-term memory abilities of chat assistants. LongMemEval presents a significant challenge to existing long-term memory systems. We present a unified framework that breaks down the long-term memory design into four design choices.
arXiv Detail & Related papers (2024-10-14T17:59:44Z)
Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models [0.0]
We introduce a dynamic benchmarking system for conversational agents that evaluates their performance through a single, simulated, and lengthy user interaction. We context switch regularly to interleave the tasks, which constructs a realistic testing scenario in which we assess the Long-Term Memory, Continual Learning, and Information Integration capabilities of the agents.
arXiv Detail & Related papers (2024-09-30T12:01:29Z)
Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA [71.04146366608904]
Long-context modeling capabilities have garnered widespread attention, leading to the emergence of Large Language Models (LLMs) with ultra-context windows. We propose a novel long-context benchmark, Loong, aligning with realistic scenarios through extended multi-document question answering (QA) Loong introduces four types of tasks with a range of context lengths: Spotlight Locating, Comparison, Clustering, and Chain of Reasoning.
arXiv Detail & Related papers (2024-06-25T09:42:56Z)
Hello Again! LLM-powered Personalized Agent for Long-term Dialogue [63.65128176360345]
We introduce a model-agnostic framework, the Long-term Dialogue Agent (LD-Agent) It incorporates three independently tunable modules dedicated to event perception, persona extraction, and response generation. The effectiveness, generality, and cross-domain capabilities of LD-Agent are empirically demonstrated.
arXiv Detail & Related papers (2024-06-09T21:58:32Z)
Analyzing Temporal Complex Events with Large Language Models? A Benchmark towards Temporal, Long Context Understanding [57.62275091656578]
We refer to the complex events composed of many news articles over an extended period as Temporal Complex Event (TCE) This paper proposes a novel approach using Large Language Models (LLMs) to systematically extract and analyze the event chain within TCE.
arXiv Detail & Related papers (2024-06-04T16:42:17Z)
Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models [75.98775135321355]
Given a long conversation, large language models (LLMs) fail to recall past information and tend to generate inconsistent responses. We propose to generate summaries/ memory using large language models (LLMs) to enhance long-term memory ability.
arXiv Detail & Related papers (2023-08-29T04:59:53Z)
Long Time No See! Open-Domain Conversation with Long-Term Persona Memory [37.51131984324123]
We present a novel task of Long-term Memory Conversation (LeMon) We then build a new dialogue dataset DuLeMon and a dialogue generation framework with Long-Term Memory (LTM) mechanism. Results on DuLeMon indicate that PLATO-LTM can significantly outperform baselines in terms of long-term dialogue consistency.
arXiv Detail & Related papers (2022-03-11T08:41:14Z)
An Exploratory Study on Long Dialogue Summarization: What Works and What's Next [33.1899354772074]
We study long dialogue summarization by investigating three strategies to deal with the lengthy input problem and locate relevant information. Our experimental results on three long dialogue datasets (QMSum, MediaSum, SummScreen) show that the retrieve-then-summarize pipeline models yield the best performance.
arXiv Detail & Related papers (2021-09-10T01:38:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.