Comparison of Speech Representations for Automatic Quality Estimation in
Multi-Speaker Text-to-Speech Synthesis
- URL: http://arxiv.org/abs/2002.12645v2
- Date: Mon, 27 Apr 2020 09:25:25 GMT
- Title: Comparison of Speech Representations for Automatic Quality Estimation in
Multi-Speaker Text-to-Speech Synthesis
- Authors: Jennifer Williams, Joanna Rownicka, Pilar Oplustil, Simon King
- Abstract summary: We aim to characterize how different speakers contribute to the perceived output quality of multi-speaker Text-to-Speech synthesis.
We automatically rate the quality of TTS using a neural network (NN) trained on human mean opinion score (MOS) ratings.
- Score: 21.904558308567122
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We aim to characterize how different speakers contribute to the perceived
output quality of multi-speaker Text-to-Speech (TTS) synthesis. We
automatically rate the quality of TTS using a neural network (NN) trained on
human mean opinion score (MOS) ratings. First, we train and evaluate our NN
model on 13 different TTS and voice conversion (VC) systems from the ASVSpoof
2019 Logical Access (LA) Dataset. Since it is not known how best to represent
speech for this task, we compare 8 different representations alongside MOSNet
frame-based features. Our representations include image-based spectrogram
features and x-vector embeddings that explicitly model different types of noise
such as T60 reverberation time. Our NN predicts MOS with a high correlation to
human judgments. We report prediction correlation and error. A key finding is
the quality achieved for certain speakers seems consistent, regardless of the
TTS or VC system. It is widely accepted that some speakers give higher quality
than others for building a TTS system: our method provides an automatic way to
identify such speakers. Finally, to see if our quality prediction models
generalize, we predict quality scores for synthetic speech using a separate
multi-speaker TTS system that was trained on LibriTTS data, and conduct our own
MOS listening test to compare human ratings with our NN predictions.
Related papers
- TTSDS -- Text-to-Speech Distribution Score [9.380879437204277]
Many recently published Text-to-Speech (TTS) systems produce audio close to real speech.
We propose evaluating the quality of synthetic speech as a combination of multiple factors such as prosody, speaker identity, and intelligibility.
We benchmark 35 TTS systems developed between 2008 and 2024 and show that our score computed as an unweighted average of factors strongly correlates with the human evaluations.
arXiv Detail & Related papers (2024-07-17T16:30:27Z) - An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios [76.11409260727459]
This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system.
We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance.
arXiv Detail & Related papers (2024-06-13T08:16:52Z) - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers [92.55131711064935]
We introduce a language modeling approach for text to speech synthesis (TTS)
Specifically, we train a neural language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio model.
Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech.
arXiv Detail & Related papers (2023-01-05T15:37:15Z) - Unsupervised TTS Acoustic Modeling for TTS with Conditional Disentangled Sequential VAE [36.50265124324876]
We propose a novel unsupervised text-to-speech acoustic model training scheme, named UTTS, which does not require text-audio pairs.
The framework offers a flexible choice of a speaker's duration model, timbre feature (identity) and content for TTS inference.
Experiments demonstrate that UTTS can synthesize speech of high naturalness and intelligibility measured by human and objective evaluations.
arXiv Detail & Related papers (2022-06-06T11:51:22Z) - NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level
Quality [123.97136358092585]
We develop a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset.
Specifically, we leverage a variational autoencoder (VAE) for end-to-end text to waveform generation.
Experiment evaluations on popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 CMOS to human recordings at the sentence level.
arXiv Detail & Related papers (2022-05-09T16:57:35Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.