Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents
- URL: http://arxiv.org/abs/2409.15594v1
- Date: Mon, 23 Sep 2024 23:01:31 GMT
- Title: Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents
- Authors: Bandhav Veluri, Benjamin N Peloquin, Bokai Yu, Hongyu Gong, Shyamnath Gollakota,
- Abstract summary: We propose Synchronous LLMs for full-sync spoken dialogue modeling.
We train a model that generates meaningful and natural spoken dialogue with just 2k hours of real-world spoken dialogue data.
We demonstrate the model's ability to participate in full-sync dialogue by simulating interaction between two agents trained on different datasets.
- Score: 12.555910887280199
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Despite broad interest in modeling spoken dialogue agents, most approaches are inherently "half-duplex" -- restricted to turn-based interaction with responses requiring explicit prompting by the user or implicit tracking of interruption or silence events. Human dialogue, by contrast, is "full-duplex" allowing for rich synchronicity in the form of quick and dynamic turn-taking, overlapping speech, and backchanneling. Technically, the challenge of achieving full-duplex dialogue with LLMs lies in modeling synchrony as pre-trained LLMs do not have a sense of "time". To bridge this gap, we propose Synchronous LLMs for full-duplex spoken dialogue modeling. We design a novel mechanism to integrate time information into Llama3-8b so that they run synchronously with the real-world clock. We also introduce a training recipe that uses 212k hours of synthetic spoken dialogue data generated from text dialogue data to create a model that generates meaningful and natural spoken dialogue, with just 2k hours of real-world spoken dialogue data. Synchronous LLMs outperform state-of-the-art in dialogue meaningfulness while maintaining naturalness. Finally, we demonstrate the model's ability to participate in full-duplex dialogue by simulating interaction between two agents trained on different datasets, while considering Internet-scale latencies of up to 240 ms. Webpage: https://syncllm.cs.washington.edu/.
Related papers
- REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation [51.97224538045096]
We introduce REALTALK, a 21-day corpus of authentic messaging app dialogues.
We compare EI attributes and persona consistency to understand the challenges posed by real-world dialogues.
Our findings reveal that models struggle to simulate a user solely from dialogue history, while fine-tuning on specific user chats improves persona emulation.
arXiv Detail & Related papers (2025-02-18T20:29:01Z) - SLIDE: Integrating Speech Language Model with LLM for Spontaneous Spoken Dialogue Generation [56.683846056788326]
We propose SLM and LLM Integration for spontaneous spoken Dialogue gEneration.
We convert the textual dialogues into phoneme sequences and use a two-tower transformer-based duration predictor to predict the duration of each phoneme.
Experimental results on the Fisher dataset demonstrate that our system can generate naturalistic spoken dialogue while maintaining high semantic coherence.
arXiv Detail & Related papers (2025-01-01T11:11:07Z) - DEMO: Reframing Dialogue Interaction with Fine-grained Element Modeling [73.08187964426823]
Large language models (LLMs) enabled dialogue systems have become one of the central modes in human-machine interaction.
This paper introduces a new research task--$textbfD$ialogue $textbfE$lement $textbfMO$deling.
We propose a novel benchmark, $textbfDEMO$, designed for a comprehensive dialogue modeling and assessment.
arXiv Detail & Related papers (2024-12-06T10:01:38Z) - OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation [53.7173034249361]
End-to-end GPT-based model OmniFlatten capable of effectively modeling complex behaviors inherent natural conversations with low latency.
Our approach offers a simple modeling technique and a promising research direction for developing efficient and natural end-to-end full- spoken dialogue systems.
arXiv Detail & Related papers (2024-10-23T11:58:58Z) - Language Model Can Listen While Speaking [17.584201137311286]
Listen-while-speaking language model (LSLM) is an end-to-end system equipped with both listening and speaking channels.
Our results highlight LSLM's capability to achieve duplex communication with minimal impact on existing systems.
arXiv Detail & Related papers (2024-08-05T16:47:22Z) - Beyond the Turn-Based Game: Enabling Real-Time Conversations with Duplex Models [66.24055500785657]
Traditional turn-based chat systems prevent users from verbally interacting with system while it is generating responses.
To overcome these limitations, we adapt existing LLMs to listen users while generating output and provide users with instant feedback.
We build a dataset consisting of alternating time slices of queries and responses as well as covering typical feedback types in instantaneous interactions.
arXiv Detail & Related papers (2024-06-22T03:20:10Z) - A Full-duplex Speech Dialogue Scheme Based On Large Language Models [23.994130020644842]
generative generative dialogue system capable in a full manner allowing seamless interaction.
System generates tokens for inquiry responses and makes autonomous decisions to respond to wait for, or operate the user.
arXiv Detail & Related papers (2024-05-29T20:05:46Z) - Are LLMs Robust for Spoken Dialogues? [10.855403629160921]
Large Pre-Trained Language Models have demonstrated state-of-the-art performance in different downstream tasks.
Most of the publicly available datasets and benchmarks on task-oriented dialogues focus on written conversations.
We have evaluated the performance of LLMs for spoken task-oriented dialogues on the DSTC11 test sets.
arXiv Detail & Related papers (2024-01-04T14:36:38Z) - Back to the Future: Bidirectional Information Decoupling Network for
Multi-turn Dialogue Modeling [80.51094098799736]
We propose Bidirectional Information Decoupling Network (BiDeN) as a universal dialogue encoder.
BiDeN explicitly incorporates both the past and future contexts and can be generalized to a wide range of dialogue-related tasks.
Experimental results on datasets of different downstream tasks demonstrate the universality and effectiveness of our BiDeN.
arXiv Detail & Related papers (2022-04-18T03:51:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.