Evaluating and reducing the distance between synthetic and real speech
distributions
- URL: http://arxiv.org/abs/2211.16049v2
- Date: Thu, 25 May 2023 08:41:57 GMT
- Title: Evaluating and reducing the distance between synthetic and real speech
distributions
- Authors: Christoph Minixhofer, Ond\v{r}ej Klejch, Peter Bell
- Abstract summary: Modern Text-to-Speech systems can produce natural-sounding speech, but are unable to reproduce the full diversity found in natural speech data.
We quantify the distance between real and synthetic speech via a range of utterance-level statistics.
Our best system achieves a 10% reduction in distribution distance.
- Score: 8.908425534666353
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While modern Text-to-Speech (TTS) systems can produce natural-sounding
speech, they remain unable to reproduce the full diversity found in natural
speech data. We consider the distribution of all possible real speech samples
that could be generated by these speakers alongside the distribution of all
synthetic samples that could be generated for the same set of speakers, using a
particular TTS system. We set out to quantify the distance between real and
synthetic speech via a range of utterance-level statistics related to
properties of the speaker, speech prosody and acoustic environment. Differences
in the distribution of these statistics are evaluated using the Wasserstein
distance. We reduce these distances by providing ground-truth values at
generation time, and quantify the improvements to the overall distribution
distance, approximated using an automatic speech recognition system. Our best
system achieves a 10\% reduction in distribution distance.
Related papers
- Sample-Efficient Diffusion for Text-To-Speech Synthesis [31.372486998377966]
It is based on a novel diffusion architecture, that we call U-Audio Transformer (U-AT)
SESD achieves impressive results despite training on less than 1k hours of speech.
It synthesizes more intelligible speech than the state-of-the-art auto-regressive model, VALL-E, while using less than 2% the training data.
arXiv Detail & Related papers (2024-09-01T20:34:36Z) - TTSDS -- Text-to-Speech Distribution Score [9.380879437204277]
Many recently published Text-to-Speech (TTS) systems produce audio close to real speech.
We propose evaluating the quality of synthetic speech as a combination of multiple factors such as prosody, speaker identity, and intelligibility.
We benchmark 35 TTS systems developed between 2008 and 2024 and show that our score computed as an unweighted average of factors strongly correlates with the human evaluations.
arXiv Detail & Related papers (2024-07-17T16:30:27Z) - Zero-shot text-to-speech synthesis conditioned using self-supervised
speech representation model [13.572330725278066]
A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data.
The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches.
arXiv Detail & Related papers (2023-04-24T10:15:58Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - A unified one-shot prosody and speaker conversion system with
self-supervised discrete speech units [94.64927912924087]
Existing systems ignore the correlation between prosody and language content, leading to degradation of naturalness in converted speech.
We devise a cascaded modular system leveraging self-supervised discrete speech units as language representation.
Experiments show that our system outperforms previous approaches in naturalness, intelligibility, speaker transferability, and prosody transferability.
arXiv Detail & Related papers (2022-11-12T00:54:09Z) - Cross-Utterance Conditioned VAE for Non-Autoregressive Text-to-Speech [27.84124625934247]
Cross-utterance conditional VAE is proposed to estimate a posterior probability distribution of the latent prosody features for each phoneme.
CUC-VAE allows sampling from an utterance-specific prior distribution conditioned on cross-utterance information.
Experimental results on LJ-Speech and LibriTTS data show that the proposed CUC-VAE TTS system improves naturalness and prosody diversity with clear margins.
arXiv Detail & Related papers (2022-05-09T08:39:53Z) - ASR data augmentation in low-resource settings using cross-lingual
multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training.
It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z) - Speech Resynthesis from Discrete Disentangled Self-Supervised
Representations [49.48053138928408]
We propose using self-supervised discrete representations for the task of speech resynthesis.
We extract low-bitrate representations for speech content, prosodic information, and speaker identity.
Using the obtained representations, we can get to a rate of 365 bits per second while providing better speech quality than the baseline methods.
arXiv Detail & Related papers (2021-04-01T09:20:33Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.