StyleSpeech: Self-supervised Style Enhancing with VQ-VAE-based
Pre-training for Expressive Audiobook Speech Synthesis
- URL: http://arxiv.org/abs/2312.12181v1
- Date: Tue, 19 Dec 2023 14:13:26 GMT
- Title: StyleSpeech: Self-supervised Style Enhancing with VQ-VAE-based
Pre-training for Expressive Audiobook Speech Synthesis
- Authors: Xueyuan Chen, Xi Wang, Shaofei Zhang, Lei He, Zhiyong Wu, Xixin Wu,
Helen Meng
- Abstract summary: The expressive quality of synthesized speech for audiobooks is limited by generalized model architecture and unbalanced style distribution.
We propose a self-supervised style enhancing method with VQ-VAE-based pre-training for expressive audiobook speech synthesis.
- Score: 63.019962126807116
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The expressive quality of synthesized speech for audiobooks is limited by
generalized model architecture and unbalanced style distribution in the
training data. To address these issues, in this paper, we propose a
self-supervised style enhancing method with VQ-VAE-based pre-training for
expressive audiobook speech synthesis. Firstly, a text style encoder is
pre-trained with a large amount of unlabeled text-only data. Secondly, a
spectrogram style extractor based on VQ-VAE is pre-trained in a self-supervised
manner, with plenty of audio data that covers complex style variations. Then a
novel architecture with two encoder-decoder paths is specially designed to
model the pronunciation and high-level style expressiveness respectively, with
the guidance of the style extractor. Both objective and subjective evaluations
demonstrate that our proposed method can effectively improve the naturalness
and expressiveness of the synthesized speech in audiobook synthesis especially
for the role and out-of-domain scenarios.
Related papers
- Style Description based Text-to-Speech with Conditional Prosodic Layer
Normalization based Diffusion GAN [17.876323494898536]
We present a Diffusion GAN based approach (Prosodic Diff-TTS) to generate the corresponding high-fidelity speech based on the style description and content text as an input to generate speech samples within only 4 denoising steps.
We demonstrate the efficacy of our proposed architecture on multi-speaker LibriTTS and PromptSpeech datasets, using multiple quantitative metrics that measure generated accuracy and MOS.
arXiv Detail & Related papers (2023-10-27T14:28:41Z) - Cross-Utterance Conditioned VAE for Speech Generation [27.5887600344053]
We present the Cross-Utterance Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to enhance prosody and ensure natural speech generation.
We propose two practical algorithms tailored for distinct speech synthesis applications: CUC-VAE TTS for text-to-speech and CUC-VAE SE for speech editing.
arXiv Detail & Related papers (2023-09-08T06:48:41Z) - Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any
Voice Conversion using Only Speech Data [2.6217304977339473]
We propose a novel method to extract rich style information from target utterances and to efficiently transfer it to source speech content.
Our proposed approach introduces an attention mechanism utilizing a self-supervised learning (SSL) model.
Experiment results show that our proposed method combined with a diffusion-based generative model can achieve better speaker similarity in any-to-any voice conversion tasks.
arXiv Detail & Related papers (2023-09-06T05:33:54Z) - Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models [64.14812728562596]
We present a method for reprogramming pre-trained audio-driven talking face synthesis models to operate in a text-driven manner.
We can easily generate face videos that articulate the provided textual sentences.
arXiv Detail & Related papers (2023-06-28T08:22:53Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - Self-supervised Context-aware Style Representation for Expressive Speech
Synthesis [23.460258571431414]
We propose a novel framework for learning style representation from plain text in a self-supervised manner.
It leverages an emotion lexicon and uses contrastive learning and deep clustering.
Our method achieves improved results according to subjective evaluations on both in-domain and out-of-domain test sets in audiobook speech.
arXiv Detail & Related papers (2022-06-25T05:29:48Z) - GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain
Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components.
Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z) - Using multiple reference audios and style embedding constraints for
speech synthesis [68.62945852651383]
The proposed model can improve the speech naturalness and content quality with multiple reference audios.
The model can also outperform the baseline model in ABX preference tests of style similarity.
arXiv Detail & Related papers (2021-10-09T04:24:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.