End-to-End Text-to-Speech Based on Latent Representation of Speaking
Styles Using Spontaneous Dialogue
- URL: http://arxiv.org/abs/2206.12040v1
- Date: Fri, 24 Jun 2022 02:32:12 GMT
- Title: End-to-End Text-to-Speech Based on Latent Representation of Speaking
Styles Using Spontaneous Dialogue
- Authors: Kentaro Mitsui, Tianyu Zhao, Kei Sawada, Yukiya Hono, Yoshihiko
Nankaku, Keiichi Tokuda
- Abstract summary: This study aims to realize a text-to-speech (TTS) that closely resembles human dialogue.
First, we record and transcribe actual spontaneous dialogues.
Proposed dialogue TTS is trained in two stages: first stage, variational autoencoder (VAE)-VITS or Gaussian mixture variational autoencoder (GMVAE)-VITS.
- Score: 19.149834552175076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent text-to-speech (TTS) has achieved quality comparable to that of
humans; however, its application in spoken dialogue has not been widely
studied. This study aims to realize a TTS that closely resembles human
dialogue. First, we record and transcribe actual spontaneous dialogues. Then,
the proposed dialogue TTS is trained in two stages: first stage, variational
autoencoder (VAE)-VITS or Gaussian mixture variational autoencoder (GMVAE)-VITS
is trained, which introduces an utterance-level latent variable into
variational inference with adversarial learning for end-to-end text-to-speech
(VITS), a recently proposed end-to-end TTS model. A style encoder that extracts
a latent speaking style representation from speech is trained jointly with TTS.
In the second stage, a style predictor is trained to predict the speaking style
to be synthesized from dialogue history. During inference, by passing the
speaking style representation predicted by the style predictor to
VAE/GMVAE-VITS, speech can be synthesized in a style appropriate to the context
of the dialogue. Subjective evaluation results demonstrate that the proposed
method outperforms the original VITS in terms of dialogue-level naturalness.
Related papers
- Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling [40.32021786228235]
Conversational Text-to-Speech (CTTS) aims to accurately express an utterance with the appropriate style within a conversational setting.
We propose a novel Emphasis Rendering scheme for the CTTS model, termed ER-CTTS.
To address data scarcity, we create emphasis intensity annotations on the existing conversational dataset (DailyTalk)
arXiv Detail & Related papers (2024-10-12T13:02:31Z) - Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation [16.724603503894166]
Style-Talker is an innovative framework that fine-tunes an audio LLM alongside a style-based TTS model for fast spoken dialog generation.
Our experimental results show that Style-Talker significantly outperforms the conventional cascade and speech-to-speech baselines in terms of both dialogue naturalness and coherence.
arXiv Detail & Related papers (2024-08-13T04:35:11Z) - Speech-Text Dialog Pre-training for Spoken Dialog Understanding with
Explicit Cross-Modal Alignment [54.8991472306962]
We propose Speech-text dialog Pre-training for spoken dialog understanding with ExpliCiT cRoss-Modal Alignment (SPECTRA)
SPECTRA is the first-ever speech-text dialog pre-training model.
Experimental results on four different downstream speech-text tasks demonstrate the superiority of SPECTRA in learning speech-text alignment and multi-turn dialog context.
arXiv Detail & Related papers (2023-05-19T10:37:56Z) - Channel-aware Decoupling Network for Multi-turn Dialogue Comprehension [81.47133615169203]
We propose compositional learning for holistic interaction across utterances beyond the sequential contextualization from PrLMs.
We employ domain-adaptive training strategies to help the model adapt to the dialogue domains.
Experimental results show that our method substantially boosts the strong PrLM baselines in four public benchmark datasets.
arXiv Detail & Related papers (2023-01-10T13:18:25Z) - FCTalker: Fine and Coarse Grained Context Modeling for Expressive
Conversational Speech Synthesis [75.74906149219817]
Conversational Text-to-Speech (TTS) aims to synthesis an utterance with the right linguistic and affective prosody in a conversational context.
We propose a novel expressive conversational TTS model, as termed FCTalker, that learn the fine and coarse grained context dependency at the same time during speech generation.
arXiv Detail & Related papers (2022-10-27T12:20:20Z) - Acoustic Modeling for End-to-End Empathetic Dialogue Speech Synthesis
Using Linguistic and Prosodic Contexts of Dialogue History [38.65020349874135]
We propose an end-to-end empathetic dialogue speech synthesis (DSS) model.
Our model is conditioned by the history of linguistic and prosody features for predicting appropriate dialogue context.
To train the empathetic DSS model effectively, we investigate 1) a self-supervised learning model pretrained with large speech corpora, 2) a style-guided training using a prosody embedding of the current utterance to be predicted by the dialogue context embedding, 3) a cross-modal attention to combine text and speech modalities, and 4) a sentence-wise embedding to achieve fine-grained prosody modeling rather than utterance-wise modeling.
arXiv Detail & Related papers (2022-06-16T09:47:25Z) - StyleTTS: A Style-Based Generative Model for Natural and Diverse
Text-to-Speech Synthesis [23.17929822987861]
StyleTTS is a style-based generative model for parallel TTS that can synthesize diverse speech with natural prosody from a reference speech utterance.
Our method significantly outperforms state-of-the-art models on both single and multi-speaker datasets.
arXiv Detail & Related papers (2022-05-30T21:34:40Z) - On Prosody Modeling for ASR+TTS based Voice Conversion [82.65378387724641]
In voice conversion, an approach showing promising results in the latest voice conversion challenge (VCC) 2020 is to first use an automatic speech recognition (ASR) model to transcribe the source speech into the underlying linguistic contents.
Such a paradigm, referred to as ASR+TTS, overlooks the modeling of prosody, which plays an important role in speech naturalness and conversion similarity.
We propose to directly predict prosody from the linguistic representation in a target-speaker-dependent manner, referred to as target text prediction (TTP)
arXiv Detail & Related papers (2021-07-20T13:30:23Z) - Spoken Style Learning with Multi-modal Hierarchical Context Encoding for
Conversational Text-to-Speech Synthesis [59.27994987902646]
The study about learning spoken styles from historical conversations is still in its infancy.
Only the transcripts of the historical conversations are considered, which neglects the spoken styles in historical speeches.
We propose a spoken style learning approach with multi-modal hierarchical context encoding.
arXiv Detail & Related papers (2021-06-11T08:33:52Z) - Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [63.561944239071615]
StyleSpeech is a new TTS model which synthesizes high-quality speech and adapts to new speakers.
With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio.
We extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
arXiv Detail & Related papers (2021-06-06T15:34:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.