Improving Prosody Modelling with Cross-Utterance BERT Embeddings for
End-to-end Speech Synthesis
- URL: http://arxiv.org/abs/2011.05161v1
- Date: Fri, 6 Nov 2020 10:03:11 GMT
- Title: Improving Prosody Modelling with Cross-Utterance BERT Embeddings for
End-to-end Speech Synthesis
- Authors: Guanghui Xu, Wei Song, Zhengchen Zhang, Chao Zhang, Xiaodong He, Bowen
Zhou
- Abstract summary: Cross-utterance (CU) context vectors are produced by an additional CU encoder based on the sentence embeddings extracted by a pre-trained BERT model.
It is also found that the prosody can be controlled indirectly by changing the neighbouring sentences.
- Score: 39.869097209615724
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite prosody is related to the linguistic information up to the discourse
structure, most text-to-speech (TTS) systems only take into account that within
each sentence, which makes it challenging when converting a paragraph of texts
into natural and expressive speech. In this paper, we propose to use the text
embeddings of the neighboring sentences to improve the prosody generation for
each utterance of a paragraph in an end-to-end fashion without using any
explicit prosody features. More specifically, cross-utterance (CU) context
vectors, which are produced by an additional CU encoder based on the sentence
embeddings extracted by a pre-trained BERT model, are used to augment the input
of the Tacotron2 decoder. Two types of BERT embeddings are investigated, which
leads to the use of different CU encoder structures. Experimental results on a
Mandarin audiobook dataset and the LJ-Speech English audiobook dataset
demonstrate the use of CU information can improve the naturalness and
expressiveness of the synthesized speech. Subjective listening testing shows
most of the participants prefer the voice generated using the CU encoder over
that generated using standard Tacotron2. It is also found that the prosody can
be controlled indirectly by changing the neighbouring sentences.
Related papers
- CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder.
Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z) - Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP [18.90593650641679]
A two-stage automatic annotation pipeline is proposed in this paper.
In the first stage, we use contrastive pretraining of Speech-Silence and Word-Punctuation pairs to enhance prosodic information in latent representations.
In the second stage, we build a multi-modal prosody annotator, comprising pretrained encoders, a text-speech fusing scheme, and a sequence classifier.
Experiments on English prosodic boundaries demonstrate that our method achieves state-of-the-art (SOTA) performance with 0.72 and 0.93 f1 score for Prosodic Word and Prosodic Phrase
arXiv Detail & Related papers (2023-09-11T12:50:28Z) - Cross-Utterance Conditioned VAE for Speech Generation [27.5887600344053]
We present the Cross-Utterance Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to enhance prosody and ensure natural speech generation.
We propose two practical algorithms tailored for distinct speech synthesis applications: CUC-VAE TTS for text-to-speech and CUC-VAE SE for speech editing.
arXiv Detail & Related papers (2023-09-08T06:48:41Z) - Improving End-to-end Speech Translation by Leveraging Auxiliary Speech
and Text Data [38.816953592085156]
We present a method for introducing a text encoder into pre-trained end-to-end speech translation systems.
It enhances the ability of adapting one modality (i.e., source-language speech) to another (i.e., source-language text)
arXiv Detail & Related papers (2022-12-04T09:27:56Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Tokenwise Contrastive Pretraining for Finer Speech-to-BERT Alignment in
End-to-End Speech-to-Intent Systems [31.18865184576272]
This work is a step towards doing the same in a much more efficient and fine-grained manner where we align speech embeddings and BERT embeddings on a token-by-token basis.
We introduce a simple yet novel technique that uses a cross-modal attention mechanism to extract token-level contextual embeddings from a speech encoder.
Fine-tuning such a pretrained model to perform intent recognition using speech directly yields state-of-the-art performance on two widely used SLU datasets.
arXiv Detail & Related papers (2022-04-11T15:24:25Z) - Improving Prosody for Unseen Texts in Speech Synthesis by Utilizing
Linguistic Information and Noisy Data [20.132799566988826]
We propose to combine a fine-tuned BERT-based front-end with a pre-trained FastSpeech2-based acoustic model to improve prosody modeling.
Experimental results show that both the fine-tuned BERT model and the pre-trained FastSpeech 2 can improve prosody, especially for those structurally complex sentences.
arXiv Detail & Related papers (2021-11-15T05:58:29Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.