Using multiple reference audios and style embedding constraints for
speech synthesis
- URL: http://arxiv.org/abs/2110.04451v1
- Date: Sat, 9 Oct 2021 04:24:29 GMT
- Title: Using multiple reference audios and style embedding constraints for
speech synthesis
- Authors: Cheng Gong, Longbiao Wang, Zhenhua Ling, Ju Zhang, Jianwu Dang
- Abstract summary: The proposed model can improve the speech naturalness and content quality with multiple reference audios.
The model can also outperform the baseline model in ABX preference tests of style similarity.
- Score: 68.62945852651383
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The end-to-end speech synthesis model can directly take an utterance as
reference audio, and generate speech from the text with prosody and speaker
characteristics similar to the reference audio. However, an appropriate
acoustic embedding must be manually selected during inference. Due to the fact
that only the matched text and speech are used in the training process, using
unmatched text and speech for inference would cause the model to synthesize
speech with low content quality. In this study, we propose to mitigate these
two problems by using multiple reference audios and style embedding constraints
rather than using only the target audio. Multiple reference audios are
automatically selected using the sentence similarity determined by
Bidirectional Encoder Representations from Transformers (BERT). In addition, we
use ''target'' style embedding from a Pre-trained encoder as a constraint by
considering the mutual information between the predicted and ''target'' style
embedding. The experimental results show that the proposed model can improve
the speech naturalness and content quality with multiple reference audios and
can also outperform the baseline model in ABX preference tests of style
similarity.
Related papers
- TokenSplit: Using Discrete Speech Representations for Direct, Refined,
and Transcript-Conditioned Speech Separation and Recognition [51.565319173790314]
TokenSplit is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture.
We show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning.
We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.
arXiv Detail & Related papers (2023-08-21T01:52:01Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - Zero-shot text-to-speech synthesis conditioned using self-supervised
speech representation model [13.572330725278066]
A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data.
The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches.
arXiv Detail & Related papers (2023-04-24T10:15:58Z) - CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled
Videos [44.14061539284888]
We propose to approach text-queried universal sound separation by using only unlabeled data.
The proposed CLIPSep model first encodes the input query into a query vector using the contrastive language-image pretraining (CLIP) model.
While the model is trained on image-audio pairs extracted from unlabeled videos, at test time we can instead query the model with text inputs in a zero-shot setting.
arXiv Detail & Related papers (2022-12-14T07:21:45Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [63.561944239071615]
StyleSpeech is a new TTS model which synthesizes high-quality speech and adapts to new speakers.
With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio.
We extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
arXiv Detail & Related papers (2021-06-06T15:34:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.