Using previous acoustic context to improve Text-to-Speech synthesis
- URL: http://arxiv.org/abs/2012.03763v1
- Date: Mon, 7 Dec 2020 15:00:18 GMT
- Title: Using previous acoustic context to improve Text-to-Speech synthesis
- Authors: Pilar Oplustil-Gallegos and Simon King
- Abstract summary: We leverage the sequential nature of the data using an acoustic context encoder that produces an embedding of the previous utterance audio.
We compare two secondary tasks: predicting the ordering of utterance pairs, and predicting the embedding of the current utterance audio.
- Score: 30.885417054452905
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Many speech synthesis datasets, especially those derived from audiobooks,
naturally comprise sequences of utterances. Nevertheless, such data are
commonly treated as individual, unordered utterances both when training a model
and at inference time. This discards important prosodic phenomena above the
utterance level. In this paper, we leverage the sequential nature of the data
using an acoustic context encoder that produces an embedding of the previous
utterance audio. This is input to the decoder in a Tacotron 2 model. The
embedding is also used for a secondary task, providing additional supervision.
We compare two secondary tasks: predicting the ordering of utterance pairs, and
predicting the embedding of the current utterance audio. Results show that the
relation between consecutive utterances is informative: our proposed model
significantly improves naturalness over a Tacotron 2 baseline.
Related papers
- Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation [72.7915031238824]
Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks.
They often suffer from common issues such as semantic misalignment and poor temporal consistency.
We propose Make-an-Audio 2, a latent diffusion-based T2A method that builds on the success of Make-an-Audio.
arXiv Detail & Related papers (2023-05-29T10:41:28Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - SVTS: Scalable Video-to-Speech Synthesis [105.29009019733803]
We introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder.
We are the first to show intelligible results on the challenging LRS3 dataset.
arXiv Detail & Related papers (2022-05-04T13:34:07Z) - WavThruVec: Latent speech representation as intermediate features for
neural speech synthesis [1.1470070927586016]
WavThruVec is a two-stage architecture that resolves the bottleneck by using high-dimensional Wav2Vec 2.0 embeddings as intermediate speech representation.
We show that the proposed model not only matches the quality of state-of-the-art neural models, but also presents useful properties enabling tasks like voice conversion or zero-shot synthesis.
arXiv Detail & Related papers (2022-03-31T10:21:08Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Improving Prosody for Unseen Texts in Speech Synthesis by Utilizing
Linguistic Information and Noisy Data [20.132799566988826]
We propose to combine a fine-tuned BERT-based front-end with a pre-trained FastSpeech2-based acoustic model to improve prosody modeling.
Experimental results show that both the fine-tuned BERT model and the pre-trained FastSpeech 2 can improve prosody, especially for those structurally complex sentences.
arXiv Detail & Related papers (2021-11-15T05:58:29Z) - Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis [68.76620947298595]
Text does not fully specify the spoken form, so text-to-speech models must be able to learn from speech data that vary in ways not explained by the corresponding text.
We propose a model that generates speech explicitly conditioned on the three primary acoustic correlates of prosody.
arXiv Detail & Related papers (2021-06-15T18:03:48Z) - Flavored Tacotron: Conditional Learning for Prosodic-linguistic Features [1.6286844497313562]
We propose a strategy for conditioning Tacotron-2 on two fundamental prosodic features in English -- stress syllable and pitch accent.
We show that jointly conditioned features at pre-encoder and intra-decoder stages result in prosodically natural synthesized speech.
arXiv Detail & Related papers (2021-04-08T20:50:15Z) - Incremental Text to Speech for Neural Sequence-to-Sequence Models using
Reinforcement Learning [60.20205278845412]
Modern approaches to text to speech require the entire input character sequence to be processed before any audio is synthesised.
This latency limits the suitability of such models for time-sensitive tasks like simultaneous interpretation.
We propose a reinforcement learning based framework to train an agent to make this decision.
arXiv Detail & Related papers (2020-08-07T11:48:05Z) - End-to-End Adversarial Text-to-Speech [33.01223309795122]
We learn to synthesise speech from normalised text or phonemes in an end-to-end manner.
Our proposed generator is feed-forward and thus efficient for both training and inference.
It learns to produce high fidelity audio through a combination of adversarial feedback and prediction losses.
arXiv Detail & Related papers (2020-06-05T17:41:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.