SLIDE: Integrating Speech Language Model with LLM for Spontaneous Spoken Dialogue Generation
- URL: http://arxiv.org/abs/2501.00805v1
- Date: Wed, 01 Jan 2025 11:11:07 GMT
- Title: SLIDE: Integrating Speech Language Model with LLM for Spontaneous Spoken Dialogue Generation
- Authors: Haitian Lu, Gaofeng Cheng, Liuping Luo, Leying Zhang, Yanmin Qian, Pengyuan Zhang,
- Abstract summary: We propose SLM and LLM Integration for spontaneous spoken Dialogue gEneration.
We convert the textual dialogues into phoneme sequences and use a two-tower transformer-based duration predictor to predict the duration of each phoneme.
Experimental results on the Fisher dataset demonstrate that our system can generate naturalistic spoken dialogue while maintaining high semantic coherence.
- Score: 56.683846056788326
- License:
- Abstract: Recently, ``textless" speech language models (SLMs) based on speech units have made huge progress in generating naturalistic speech, including non-verbal vocalizations. However, the generated speech samples often lack semantic coherence. In this paper, we propose SLM and LLM Integration for spontaneous spoken Dialogue gEneration (SLIDE). Specifically, we first utilize an LLM to generate the textual content of spoken dialogue. Next, we convert the textual dialogues into phoneme sequences and use a two-tower transformer-based duration predictor to predict the duration of each phoneme. Finally, an SLM conditioned on the spoken phoneme sequences is used to vocalize the textual dialogue. Experimental results on the Fisher dataset demonstrate that our system can generate naturalistic spoken dialogue while maintaining high semantic coherence.
Related papers
- OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation [53.7173034249361]
End-to-end GPT-based model OmniFlatten capable of effectively modeling complex behaviors inherent natural conversations with low latency.
Our approach offers a simple modeling technique and a promising research direction for developing efficient and natural end-to-end full- spoken dialogue systems.
arXiv Detail & Related papers (2024-10-23T11:58:58Z) - IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities [55.11130688075417]
We introduce IntrinsicVoic,e an LLM designed with intrinsic real-time voice interaction capabilities.
Our novelty architecture, GroupFormer, can reduce speech sequences to lengths comparable to text sequences.
We construct a multi-turn speech-to-speech dialogue dataset named method-500k which includes nearly 500k turns of speech-to-speech dialogues.
arXiv Detail & Related papers (2024-10-09T05:04:31Z) - Moshi: a speech-text foundation model for real-time dialogue [78.88479749811376]
Current systems for spoken dialogue rely on pipelines independent voice activity detection and text-to-speech.
We show how Moshi Moshi can provide streaming speech recognition and text-to-speech.
Our resulting model is first real-time full spoken large language model modality.
arXiv Detail & Related papers (2024-09-17T17:55:39Z) - Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation [16.724603503894166]
Style-Talker is an innovative framework that fine-tunes an audio LLM alongside a style-based TTS model for fast spoken dialog generation.
Our experimental results show that Style-Talker significantly outperforms the conventional cascade and speech-to-speech baselines in terms of both dialogue naturalness and coherence.
arXiv Detail & Related papers (2024-08-13T04:35:11Z) - Advancing Large Language Models to Capture Varied Speaking Styles and Respond Properly in Spoken Conversations [65.29513437838457]
Even if two current turns are the same sentence, their responses might still differ when they are spoken in different styles.
We propose Spoken-LLM framework that can model the linguistic content and the speaking styles.
We train Spoken-LLM using the StyleTalk dataset and devise a two-stage training pipeline to help the Spoken-LLM better learn the speaking styles.
arXiv Detail & Related papers (2024-02-20T07:51:43Z) - Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation [46.93969003104427]
This paper introduces an extensive speech-text LLM framework, the Unified Spoken Dialog Model (USDM)
USDM is designed to generate coherent spoken responses with naturally occurring prosodic features relevant to the given input speech.
Our approach effectively generates natural-sounding spoken responses, surpassing previous and cascaded baselines.
arXiv Detail & Related papers (2024-02-08T14:35:09Z) - Towards human-like spoken dialogue generation between AI agents from
written dialogue [8.4989907582951]
This study proposes CHATS - CHatty Agents Text-to-Speech - a discrete token-based system designed to generate spoken dialogues based on written dialogues.
Our system can generate speech for both the speaker side and the listener side simultaneously, using only the transcription from the speaker side.
arXiv Detail & Related papers (2023-10-02T11:03:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.