Real-Time Textless Dialogue Generation
- URL: http://arxiv.org/abs/2501.04877v1
- Date: Wed, 08 Jan 2025 23:21:43 GMT
- Title: Real-Time Textless Dialogue Generation
- Authors: Long Mai, Julie Carson-Berndsen,
- Abstract summary: We propose a real-time, textless spoken dialogue generation model (RTTL-DG)
Our system enables fluid turn-taking and generates responses with minimal delay by processing streaming spoken conversation directly.
Our model incorporates backchannels, filters, laughter, and other paralinguistic signals, which are often absent in cascaded dialogue systems.
- Score: 23.456302461693053
- License:
- Abstract: Recent advancements in large language models (LLMs) have led to significant progress in text-based dialogue systems. These systems can now generate high-quality responses that are accurate and coherent across a wide range of topics and tasks. However, spoken dialogue systems still lag behind in terms of naturalness. They tend to produce robotic interactions, with issues such as slow response times, overly generic or cautious replies, and a lack of natural rhythm and fluid turn-taking. This shortcoming is largely due to the over-reliance on the traditional cascaded design, which involve separate, sequential components, as well as the use of text as an intermediate representation. This paper propose a real-time, textless spoken dialogue generation model (RTTL-DG) that aims to overcome these challenges. Our system enables fluid turn-taking and generates responses with minimal delay by processing streaming spoken conversation directly. Additionally, our model incorporates backchannels, filters, laughter, and other paralinguistic signals, which are often absent in cascaded dialogue systems, to create more natural and human-like interactions. The implementations and generated samples are available in our repository: https://github.com/mailong25/rts2s-dg
Related papers
- WavChat: A Survey of Spoken Dialogue Models [66.82775211793547]
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain.
These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech.
Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems.
arXiv Detail & Related papers (2024-11-15T04:16:45Z) - OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation [53.7173034249361]
End-to-end GPT-based model OmniFlatten capable of effectively modeling complex behaviors inherent natural conversations with low latency.
Our approach offers a simple modeling technique and a promising research direction for developing efficient and natural end-to-end full- spoken dialogue systems.
arXiv Detail & Related papers (2024-10-23T11:58:58Z) - IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities [55.11130688075417]
We introduce IntrinsicVoic,e an LLM designed with intrinsic real-time voice interaction capabilities.
Our novelty architecture, GroupFormer, can reduce speech sequences to lengths comparable to text sequences.
We construct a multi-turn speech-to-speech dialogue dataset named method-500k which includes nearly 500k turns of speech-to-speech dialogues.
arXiv Detail & Related papers (2024-10-09T05:04:31Z) - Recent Advances in Speech Language Models: A Survey [45.968078636811356]
Speech Language Models (SpeechLMs) are end-to-end models that generate speech without converting from text.
This survey paper provides the first comprehensive overview of recent methodologies for constructing SpeechLMs.
arXiv Detail & Related papers (2024-10-01T21:48:12Z) - Moshi: a speech-text foundation model for real-time dialogue [78.88479749811376]
Current systems for spoken dialogue rely on pipelines independent voice activity detection and text-to-speech.
We show how Moshi Moshi can provide streaming speech recognition and text-to-speech.
Our resulting model is first real-time full spoken large language model modality.
arXiv Detail & Related papers (2024-09-17T17:55:39Z) - Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation [16.724603503894166]
Style-Talker is an innovative framework that fine-tunes an audio LLM alongside a style-based TTS model for fast spoken dialog generation.
Our experimental results show that Style-Talker significantly outperforms the conventional cascade and speech-to-speech baselines in terms of both dialogue naturalness and coherence.
arXiv Detail & Related papers (2024-08-13T04:35:11Z) - PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems [7.326036800127981]
Multimodal language models that process both text and speech have a potential for applications in spoken dialogue systems.
generating a spoken response requires the prior generation of a written response, and speech sequences are significantly longer than text sequences.
This study addresses these issues by extending the input and output sequences of the language model to support the parallel generation of text and speech.
arXiv Detail & Related papers (2024-06-18T09:23:54Z) - Modeling Real-Time Interactive Conversations as Timed Diarized Transcripts [11.067252960486272]
We present a simple yet general method to simulate real-time interactive conversations using pretrained language models.
We demonstrate the promise of this method with two case studies: instant messenger dialogues and spoken conversations.
arXiv Detail & Related papers (2024-05-21T21:14:31Z) - Multimodal Contextual Dialogue Breakdown Detection for Conversational AI Models [1.4199474167684119]
We introduce a Multimodal Contextual Dialogue Breakdown (MultConDB) model.
This model significantly outperforms other known best models by achieving an F1 of 69.27.
arXiv Detail & Related papers (2024-04-11T23:09:18Z) - CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations [97.75037148056367]
CoVoMix is a novel model for zero-shot, human-like, multi-speaker, multi-round dialogue speech generation.
We devise a comprehensive set of metrics for measuring the effectiveness of dialogue modeling and generation.
arXiv Detail & Related papers (2024-04-10T02:32:58Z) - Back to the Future: Bidirectional Information Decoupling Network for
Multi-turn Dialogue Modeling [80.51094098799736]
We propose Bidirectional Information Decoupling Network (BiDeN) as a universal dialogue encoder.
BiDeN explicitly incorporates both the past and future contexts and can be generalized to a wide range of dialogue-related tasks.
Experimental results on datasets of different downstream tasks demonstrate the universality and effectiveness of our BiDeN.
arXiv Detail & Related papers (2022-04-18T03:51:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.