Towards Robust FastSpeech 2 by Modelling Residual Multimodality
- URL: http://arxiv.org/abs/2306.01442v1
- Date: Fri, 2 Jun 2023 11:03:26 GMT
- Title: Towards Robust FastSpeech 2 by Modelling Residual Multimodality
- Authors: Fabian K\"ogel, Bac Nguyen, Fabien Cardinaux
- Abstract summary: State-of-the-art non-autoregressive text-to-speech models based on FastSpeech 2 can efficiently synthesise high-fidelity and natural speech.
We observe characteristic audio distortions in expressive speech datasets.
TVC-GMM reduces spectrogram smoothness and improves perceptual audio quality in particular for expressive datasets.
- Score: 4.4904382374090765
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: State-of-the-art non-autoregressive text-to-speech (TTS) models based on
FastSpeech 2 can efficiently synthesise high-fidelity and natural speech. For
expressive speech datasets however, we observe characteristic audio
distortions. We demonstrate that such artefacts are introduced to the vocoder
reconstruction by over-smooth mel-spectrogram predictions, which are induced by
the choice of mean-squared-error (MSE) loss for training the mel-spectrogram
decoder. With MSE loss FastSpeech 2 is limited to learn conditional averages of
the training distribution, which might not lie close to a natural sample if the
distribution still appears multimodal after all conditioning signals. To
alleviate this problem, we introduce TVC-GMM, a mixture model of
Trivariate-Chain Gaussian distributions, to model the residual multimodality.
TVC-GMM reduces spectrogram smoothness and improves perceptual audio quality in
particular for expressive datasets as shown by both objective and subjective
evaluation.
Related papers
- Autoregressive Speech Synthesis without Vector Quantization [135.4776759536272]
We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS)
MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition.
arXiv Detail & Related papers (2024-07-11T14:36:53Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Adversarial Training of Denoising Diffusion Model Using Dual
Discriminators for High-Fidelity Multi-Speaker TTS [0.0]
The diffusion model is capable of generating high-quality data through a probabilistic approach.
It suffers from the drawback of slow generation speed due to the requirement of a large number of time steps.
We propose a speech synthesis model with two discriminators: a diffusion discriminator for learning the distribution of the reverse process and a spectrogram discriminator for learning the distribution of the generated data.
arXiv Detail & Related papers (2023-08-03T07:22:04Z) - Minimally-Supervised Speech Synthesis with Conditional Diffusion Model
and Language Model: A Comparative Study of Semantic Coding [57.42429912884543]
We propose Diff-LM-Speech, Tetra-Diff-Speech and Tri-Diff-Speech to solve high dimensionality and waveform distortion problems.
We also introduce a prompt encoder structure based on a variational autoencoder and a prosody bottleneck to improve prompt representation ability.
Experimental results show that our proposed methods outperform baseline methods.
arXiv Detail & Related papers (2023-07-28T11:20:23Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - SVTS: Scalable Video-to-Speech Synthesis [105.29009019733803]
We introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder.
We are the first to show intelligible results on the challenging LRS3 dataset.
arXiv Detail & Related papers (2022-05-04T13:34:07Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - Speech Prediction in Silent Videos using Variational Autoencoders [29.423462898526605]
We present a model for generating speech in a silent video.
The proposed model combines recurrent neural networks and variational deep generative models to learn the auditory's conditional distribution.
We demonstrate the performance of our model on the GRID dataset based on standard benchmarks.
arXiv Detail & Related papers (2020-11-14T17:09:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.