Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation
- URL: http://arxiv.org/abs/2408.11849v1
- Date: Tue, 13 Aug 2024 04:35:11 GMT
- Title: Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation
- Authors: Yinghao Aaron Li, Xilin Jiang, Jordan Darefsky, Ge Zhu, Nima Mesgarani,
- Abstract summary: Style-Talker is an innovative framework that fine-tunes an audio LLM alongside a style-based TTS model for fast spoken dialog generation.
Our experimental results show that Style-Talker significantly outperforms the conventional cascade and speech-to-speech baselines in terms of both dialogue naturalness and coherence.
- Score: 16.724603503894166
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The rapid advancement of large language models (LLMs) has significantly propelled the development of text-based chatbots, demonstrating their capability to engage in coherent and contextually relevant dialogues. However, extending these advancements to enable end-to-end speech-to-speech conversation bots remains a formidable challenge, primarily due to the extensive dataset and computational resources required. The conventional approach of cascading automatic speech recognition (ASR), LLM, and text-to-speech (TTS) models in a pipeline, while effective, suffers from unnatural prosody because it lacks direct interactions between the input audio and its transcribed text and the output audio. These systems are also limited by their inherent latency from the ASR process for real-time applications. This paper introduces Style-Talker, an innovative framework that fine-tunes an audio LLM alongside a style-based TTS model for fast spoken dialog generation. Style-Talker takes user input audio and uses transcribed chat history and speech styles to generate both the speaking style and text for the response. Subsequently, the TTS model synthesizes the speech, which is then played back to the user. While the response speech is being played, the input speech undergoes ASR processing to extract the transcription and speaking style, serving as the context for the ensuing dialogue turn. This novel pipeline accelerates the traditional cascade ASR-LLM-TTS systems while integrating rich paralinguistic information from input speech. Our experimental results show that Style-Talker significantly outperforms the conventional cascade and speech-to-speech baselines in terms of both dialogue naturalness and coherence while being more than 50% faster.
Related papers
- WavChat: A Survey of Spoken Dialogue Models [66.82775211793547]
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain.
These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech.
Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems.
arXiv Detail & Related papers (2024-11-15T04:16:45Z) - IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities [55.11130688075417]
We introduce IntrinsicVoic,e an LLM designed with intrinsic real-time voice interaction capabilities.
Our novelty architecture, GroupFormer, can reduce speech sequences to lengths comparable to text sequences.
We construct a multi-turn speech-to-speech dialogue dataset named method-500k which includes nearly 500k turns of speech-to-speech dialogues.
arXiv Detail & Related papers (2024-10-09T05:04:31Z) - Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM [3.6950912517562435]
We propose a method that implicitly internalizes ASR chain of thought into a speech LLM, enhancing its native speech understanding capabilities.
Our approach reduces latency and improves the model's native understanding of speech, paving the way for more efficient and natural real-time audio interactions.
arXiv Detail & Related papers (2024-09-25T20:59:12Z) - Moshi: a speech-text foundation model for real-time dialogue [78.88479749811376]
Current systems for spoken dialogue rely on pipelines independent voice activity detection and text-to-speech.
We show how Moshi Moshi can provide streaming speech recognition and text-to-speech.
Our resulting model is first real-time full spoken large language model modality.
arXiv Detail & Related papers (2024-09-17T17:55:39Z) - Generative Expressive Conversational Speech Synthesis [47.53014375797254]
Conversational Speech Synthesis (CSS) aims to express a target utterance with the proper speaking style in a user-agent conversation setting.
We propose a novel generative expressive CSS system, termed GPT-Talker.
We transform the multimodal information of the multi-turn dialogue history into discrete token sequences and seamlessly integrate them to form a comprehensive user-agent dialogue context.
arXiv Detail & Related papers (2024-07-31T10:02:21Z) - Integrating Paralinguistics in Speech-Empowered Large Language Models for Natural Conversation [46.93969003104427]
This paper introduces an extensive speech-text LLM framework, the Unified Spoken Dialog Model (USDM)
USDM is designed to generate coherent spoken responses with naturally occurring prosodic features relevant to the given input speech.
Our approach effectively generates natural-sounding spoken responses, surpassing previous and cascaded baselines.
arXiv Detail & Related papers (2024-02-08T14:35:09Z) - Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication.
We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z) - End-to-End Text-to-Speech Based on Latent Representation of Speaking
Styles Using Spontaneous Dialogue [19.149834552175076]
This study aims to realize a text-to-speech (TTS) that closely resembles human dialogue.
First, we record and transcribe actual spontaneous dialogues.
Proposed dialogue TTS is trained in two stages: first stage, variational autoencoder (VAE)-VITS or Gaussian mixture variational autoencoder (GMVAE)-VITS.
arXiv Detail & Related papers (2022-06-24T02:32:12Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - Spoken Style Learning with Multi-modal Hierarchical Context Encoding for
Conversational Text-to-Speech Synthesis [59.27994987902646]
The study about learning spoken styles from historical conversations is still in its infancy.
Only the transcripts of the historical conversations are considered, which neglects the spoken styles in historical speeches.
We propose a spoken style learning approach with multi-modal hierarchical context encoding.
arXiv Detail & Related papers (2021-06-11T08:33:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.