Cross-Utterance Conditioned VAE for Speech Generation
- URL: http://arxiv.org/abs/2309.04156v1
- Date: Fri, 8 Sep 2023 06:48:41 GMT
- Title: Cross-Utterance Conditioned VAE for Speech Generation
- Authors: Yang Li, Cheng Yu, Guangzhi Sun, Weiqin Zu, Zheng Tian, Ying Wen, Wei
Pan, Chao Zhang, Jun Wang, Yang Yang, Fanglei Sun
- Abstract summary: We present the Cross-Utterance Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to enhance prosody and ensure natural speech generation.
We propose two practical algorithms tailored for distinct speech synthesis applications: CUC-VAE TTS for text-to-speech and CUC-VAE SE for speech editing.
- Score: 27.5887600344053
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech synthesis systems powered by neural networks hold promise for
multimedia production, but frequently face issues with producing expressive
speech and seamless editing. In response, we present the Cross-Utterance
Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to
enhance prosody and ensure natural speech generation. This framework leverages
the powerful representational capabilities of pre-trained language models and
the re-expression abilities of variational autoencoders (VAEs). The core
component of the CUC-VAE S2 framework is the cross-utterance CVAE, which
extracts acoustic, speaker, and textual features from surrounding sentences to
generate context-sensitive prosodic features, more accurately emulating human
prosody generation. We further propose two practical algorithms tailored for
distinct speech synthesis applications: CUC-VAE TTS for text-to-speech and
CUC-VAE SE for speech editing. The CUC-VAE TTS is a direct application of the
framework, designed to generate audio with contextual prosody derived from
surrounding texts. On the other hand, the CUC-VAE SE algorithm leverages real
mel spectrogram sampling conditioned on contextual information, producing audio
that closely mirrors real sound and thereby facilitating flexible speech
editing based on text such as deletion, insertion, and replacement.
Experimental results on the LibriTTS datasets demonstrate that our proposed
models significantly enhance speech synthesis and editing, producing more
natural and expressive speech.
Related papers
- StyleSpeech: Self-supervised Style Enhancing with VQ-VAE-based
Pre-training for Expressive Audiobook Speech Synthesis [63.019962126807116]
The expressive quality of synthesized speech for audiobooks is limited by generalized model architecture and unbalanced style distribution.
We propose a self-supervised style enhancing method with VQ-VAE-based pre-training for expressive audiobook speech synthesis.
arXiv Detail & Related papers (2023-12-19T14:13:26Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - Improving Code-Switching and Named Entity Recognition in ASR with Speech
Editing based Data Augmentation [22.38340990398735]
We propose a novel data augmentation method by applying the text-based speech editing model.
The experimental results on code-switching and NER tasks show that our proposed method can significantly outperform the audio splicing and neural TTS based data augmentation systems.
arXiv Detail & Related papers (2023-06-14T15:50:13Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z) - Unsupervised TTS Acoustic Modeling for TTS with Conditional Disentangled Sequential VAE [36.50265124324876]
We propose a novel unsupervised text-to-speech acoustic model training scheme, named UTTS, which does not require text-audio pairs.
The framework offers a flexible choice of a speaker's duration model, timbre feature (identity) and content for TTS inference.
Experiments demonstrate that UTTS can synthesize speech of high naturalness and intelligibility measured by human and objective evaluations.
arXiv Detail & Related papers (2022-06-06T11:51:22Z) - A$^3$T: Alignment-Aware Acoustic and Text Pretraining for Speech
Synthesis and Editing [31.666920933058144]
We propose our framework, Alignment-Aware Acoustic-Text Pretraining (A$3$T), which reconstructs masked acoustic signals with text input and acoustic-text alignment during training.
Experiments show A$3$T outperforms SOTA models on speech editing, and improves multi-speaker speech synthesis without the external speaker verification model.
arXiv Detail & Related papers (2022-03-18T01:36:25Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.