Acoustic Modeling for End-to-End Empathetic Dialogue Speech Synthesis
Using Linguistic and Prosodic Contexts of Dialogue History
- URL: http://arxiv.org/abs/2206.08039v1
- Date: Thu, 16 Jun 2022 09:47:25 GMT
- Title: Acoustic Modeling for End-to-End Empathetic Dialogue Speech Synthesis
Using Linguistic and Prosodic Contexts of Dialogue History
- Authors: Yuto Nishimura, Yuki Saito, Shinnosuke Takamichi, Kentaro Tachibana,
and Hiroshi Saruwatari
- Abstract summary: We propose an end-to-end empathetic dialogue speech synthesis (DSS) model.
Our model is conditioned by the history of linguistic and prosody features for predicting appropriate dialogue context.
To train the empathetic DSS model effectively, we investigate 1) a self-supervised learning model pretrained with large speech corpora, 2) a style-guided training using a prosody embedding of the current utterance to be predicted by the dialogue context embedding, 3) a cross-modal attention to combine text and speech modalities, and 4) a sentence-wise embedding to achieve fine-grained prosody modeling rather than utterance-wise modeling.
- Score: 38.65020349874135
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We propose an end-to-end empathetic dialogue speech synthesis (DSS) model
that considers both the linguistic and prosodic contexts of dialogue history.
Empathy is the active attempt by humans to get inside the interlocutor in
dialogue, and empathetic DSS is a technology to implement this act in spoken
dialogue systems. Our model is conditioned by the history of linguistic and
prosody features for predicting appropriate dialogue context. As such, it can
be regarded as an extension of the conventional linguistic-feature-based
dialogue history modeling. To train the empathetic DSS model effectively, we
investigate 1) a self-supervised learning model pretrained with large speech
corpora, 2) a style-guided training using a prosody embedding of the current
utterance to be predicted by the dialogue context embedding, 3) a cross-modal
attention to combine text and speech modalities, and 4) a sentence-wise
embedding to achieve fine-grained prosody modeling rather than utterance-wise
modeling. The evaluation results demonstrate that 1) simply considering
prosodic contexts of the dialogue history does not improve the quality of
speech in empathetic DSS and 2) introducing style-guided training and
sentence-wise embedding modeling achieves higher speech quality than that by
the conventional method.
Related papers
- WavChat: A Survey of Spoken Dialogue Models [66.82775211793547]
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain.
These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech.
Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems.
arXiv Detail & Related papers (2024-11-15T04:16:45Z) - Towards Joint Modeling of Dialogue Response and Speech Synthesis based
on Large Language Model [8.180382743037082]
This paper explores the potential of constructing an AI spoken dialogue system that "thinks how to respond" and "thinks how to speak" simultaneously.
arXiv Detail & Related papers (2023-09-20T01:48:27Z) - FutureTOD: Teaching Future Knowledge to Pre-trained Language Model for
Task-Oriented Dialogue [20.79359173822053]
We propose a novel dialogue pre-training model, FutureTOD, which distills future knowledge to the representation of the previous dialogue context.
Our intuition is that a good dialogue representation both learns local context information and predicts future information.
arXiv Detail & Related papers (2023-06-17T10:40:07Z) - STRUDEL: Structured Dialogue Summarization for Dialogue Comprehension [42.57581945778631]
Abstractive dialogue summarization has long been viewed as an important standalone task in natural language processing.
We propose a novel type of dialogue summarization task - STRUctured DiaLoguE Summarization.
We show that our STRUDEL dialogue comprehension model can significantly improve the dialogue comprehension performance of transformer encoder language models.
arXiv Detail & Related papers (2022-12-24T04:39:54Z) - FCTalker: Fine and Coarse Grained Context Modeling for Expressive
Conversational Speech Synthesis [75.74906149219817]
Conversational Text-to-Speech (TTS) aims to synthesis an utterance with the right linguistic and affective prosody in a conversational context.
We propose a novel expressive conversational TTS model, as termed FCTalker, that learn the fine and coarse grained context dependency at the same time during speech generation.
arXiv Detail & Related papers (2022-10-27T12:20:20Z) - End-to-End Text-to-Speech Based on Latent Representation of Speaking
Styles Using Spontaneous Dialogue [19.149834552175076]
This study aims to realize a text-to-speech (TTS) that closely resembles human dialogue.
First, we record and transcribe actual spontaneous dialogues.
Proposed dialogue TTS is trained in two stages: first stage, variational autoencoder (VAE)-VITS or Gaussian mixture variational autoencoder (GMVAE)-VITS.
arXiv Detail & Related papers (2022-06-24T02:32:12Z) - Advances in Multi-turn Dialogue Comprehension: A Survey [51.215629336320305]
Training machines to understand natural language and interact with humans is an elusive and essential task of artificial intelligence.
This paper reviews the previous methods from the technical perspective of dialogue modeling for the dialogue comprehension task.
In addition, we categorize dialogue-related pre-training techniques which are employed to enhance PrLMs in dialogue scenarios.
arXiv Detail & Related papers (2021-10-11T03:52:37Z) - Advances in Multi-turn Dialogue Comprehension: A Survey [51.215629336320305]
We review the previous methods from the perspective of dialogue modeling.
We discuss three typical patterns of dialogue modeling that are widely-used in dialogue comprehension tasks.
arXiv Detail & Related papers (2021-03-04T15:50:17Z) - Filling the Gap of Utterance-aware and Speaker-aware Representation for
Multi-turn Dialogue [76.88174667929665]
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles.
In the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely.
We propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history.
arXiv Detail & Related papers (2020-09-14T15:07:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.