Towards Joint Modeling of Dialogue Response and Speech Synthesis based
on Large Language Model
- URL: http://arxiv.org/abs/2309.11000v1
- Date: Wed, 20 Sep 2023 01:48:27 GMT
- Title: Towards Joint Modeling of Dialogue Response and Speech Synthesis based
on Large Language Model
- Authors: Xinyu Zhou, Delong Chen, Yudong Chen
- Abstract summary: This paper explores the potential of constructing an AI spoken dialogue system that "thinks how to respond" and "thinks how to speak" simultaneously.
- Score: 8.180382743037082
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper explores the potential of constructing an AI spoken dialogue
system that "thinks how to respond" and "thinks how to speak" simultaneously,
which more closely aligns with the human speech production process compared to
the current cascade pipeline of independent chatbot and Text-to-Speech (TTS)
modules. We hypothesize that Large Language Models (LLMs) with billions of
parameters possess significant speech understanding capabilities and can
jointly model dialogue responses and linguistic features. We conduct two sets
of experiments: 1) Prosodic structure prediction, a typical front-end task in
TTS, demonstrating the speech understanding ability of LLMs, and 2) Further
integrating dialogue response and a wide array of linguistic features using a
unified encoding format. Our results indicate that the LLM-based approach is a
promising direction for building unified spoken dialogue systems.
Related papers
- OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation [24.68804661538364]
Full spoken dialogue systems significantly mirror human-human interactions.
achieving low latency and natural interactions is a significant challenge.
End-to-end full-to-end spoken dialogue systems are a promising direction for developing efficient and natural end-to-end systems.
Audio samples of dialogues generated by OmniFlatten can be found at this web site.
arXiv Detail & Related papers (2024-10-23T11:58:58Z) - Integrating Paralinguistics in Speech-Empowered Large Language Models for Natural Conversation [46.93969003104427]
This paper introduces an extensive speech-text LLM framework, the Unified Spoken Dialog Model (USDM)
USDM is designed to generate coherent spoken responses with naturally occurring prosodic features relevant to the given input speech.
Our approach effectively generates natural-sounding spoken responses, surpassing previous and cascaded baselines.
arXiv Detail & Related papers (2024-02-08T14:35:09Z) - Channel-aware Decoupling Network for Multi-turn Dialogue Comprehension [81.47133615169203]
We propose compositional learning for holistic interaction across utterances beyond the sequential contextualization from PrLMs.
We employ domain-adaptive training strategies to help the model adapt to the dialogue domains.
Experimental results show that our method substantially boosts the strong PrLM baselines in four public benchmark datasets.
arXiv Detail & Related papers (2023-01-10T13:18:25Z) - STRUDEL: Structured Dialogue Summarization for Dialogue Comprehension [42.57581945778631]
Abstractive dialogue summarization has long been viewed as an important standalone task in natural language processing.
We propose a novel type of dialogue summarization task - STRUctured DiaLoguE Summarization.
We show that our STRUDEL dialogue comprehension model can significantly improve the dialogue comprehension performance of transformer encoder language models.
arXiv Detail & Related papers (2022-12-24T04:39:54Z) - Acoustic Modeling for End-to-End Empathetic Dialogue Speech Synthesis
Using Linguistic and Prosodic Contexts of Dialogue History [38.65020349874135]
We propose an end-to-end empathetic dialogue speech synthesis (DSS) model.
Our model is conditioned by the history of linguistic and prosody features for predicting appropriate dialogue context.
To train the empathetic DSS model effectively, we investigate 1) a self-supervised learning model pretrained with large speech corpora, 2) a style-guided training using a prosody embedding of the current utterance to be predicted by the dialogue context embedding, 3) a cross-modal attention to combine text and speech modalities, and 4) a sentence-wise embedding to achieve fine-grained prosody modeling rather than utterance-wise modeling.
arXiv Detail & Related papers (2022-06-16T09:47:25Z) - Back to the Future: Bidirectional Information Decoupling Network for
Multi-turn Dialogue Modeling [80.51094098799736]
We propose Bidirectional Information Decoupling Network (BiDeN) as a universal dialogue encoder.
BiDeN explicitly incorporates both the past and future contexts and can be generalized to a wide range of dialogue-related tasks.
Experimental results on datasets of different downstream tasks demonstrate the universality and effectiveness of our BiDeN.
arXiv Detail & Related papers (2022-04-18T03:51:46Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - Filling the Gap of Utterance-aware and Speaker-aware Representation for
Multi-turn Dialogue [76.88174667929665]
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles.
In the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely.
We propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history.
arXiv Detail & Related papers (2020-09-14T15:07:19Z) - Video-Grounded Dialogues with Pretrained Generation Language Models [88.15419265622748]
We leverage the power of pre-trained language models for improving video-grounded dialogue.
We propose a framework by formulating sequence-to-grounded dialogue tasks as a sequence-to-grounded task.
Our framework allows fine-tuning language models to capture dependencies across multiple modalities.
arXiv Detail & Related papers (2020-06-27T08:24:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.