Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback
- URL: http://arxiv.org/abs/2406.00654v1
- Date: Sun, 2 Jun 2024 07:54:33 GMT
- Title: Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback
- Authors: Chen Chen, Yuchen Hu, Wen Wu, Helin Wang, Eng Siong Chng, Chao Zhang,
- Abstract summary: We propose a sampling-annotating-learning framework tailored to text-to-speech (TTS) optimization.
We show that UNO considerably improves the zero-shot performance of TTS models in terms of MOS, word error rate, and speaker similarity.
We also present a remarkable ability of UNO that it can adapt to the desired speaking style in emotional TTS seamlessly and flexibly.
- Score: 39.54647336161013
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, text-to-speech (TTS) technology has witnessed impressive advancements, particularly with large-scale training datasets, showcasing human-level speech quality and impressive zero-shot capabilities on unseen speakers. However, despite human subjective evaluations, such as the mean opinion score (MOS), remaining the gold standard for assessing the quality of synthetic speech, even state-of-the-art TTS approaches have kept human feedback isolated from training that resulted in mismatched training objectives and evaluation metrics. In this work, we investigate a novel topic of integrating subjective human evaluation into the TTS training loop. Inspired by the recent success of reinforcement learning from human feedback, we propose a comprehensive sampling-annotating-learning framework tailored to TTS optimization, namely uncertainty-aware optimization (UNO). Specifically, UNO eliminates the need for a reward model or preference data by directly maximizing the utility of speech generations while considering the uncertainty that lies in the inherent variability in subjective human speech perception and evaluations. Experimental results of both subjective and objective evaluations demonstrate that UNO considerably improves the zero-shot performance of TTS models in terms of MOS, word error rate, and speaker similarity. Additionally, we present a remarkable ability of UNO that it can adapt to the desired speaking style in emotional TTS seamlessly and flexibly.
Related papers
- Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation [12.954531089716008]
MUSHRA test is a promising alternative for evaluating TTS systems simultaneously.
We show that its reliance on matching human reference speech unduly penalises the scores of modern TTS systems.
We propose two refined variants of the MUSHRA test.
arXiv Detail & Related papers (2024-11-19T18:37:45Z) - Improving Machine Translation with Human Feedback: An Exploration of Quality Estimation as a Reward Model [75.66013048128302]
In this work, we investigate the potential of employing the QE model as the reward model to predict human preferences for feedback training.
We first identify the overoptimization problem during QE-based feedback training, manifested as an increase in reward while translation quality declines.
To address the problem, we adopt a simple yet effective method that uses rules to detect the incorrect translations and assigns a penalty term to the reward scores of them.
arXiv Detail & Related papers (2024-01-23T16:07:43Z) - Evaluating Speech Synthesis by Training Recognizers on Synthetic Speech [34.8899247119748]
We propose an evaluation technique involving the training of an ASR model on synthetic speech and assessing its performance on real speech.
Our proposed metric demonstrates a strong correlation with both MOS naturalness and MOS intelligibility when compared to SpeechLMScore and MOSNet.
arXiv Detail & Related papers (2023-10-01T15:52:48Z) - NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level
Quality [123.97136358092585]
We develop a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset.
Specifically, we leverage a variational autoencoder (VAE) for end-to-end text to waveform generation.
Experiment evaluations on popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 CMOS to human recordings at the sentence level.
arXiv Detail & Related papers (2022-05-09T16:57:35Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - A study on the efficacy of model pre-training in developing neural
text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance.
It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z) - Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis [76.39883780990489]
We analyze the behavior of non-autoregressive TTS models under different prosody-modeling settings.
We propose a hierarchical architecture, in which the prediction of phoneme-level prosody features are conditioned on the word-level prosody features.
arXiv Detail & Related papers (2020-11-12T16:16:41Z) - Learning to Maximize Speech Quality Directly Using MOS Prediction for
Neural Text-to-Speech [15.796199345773873]
We propose a novel method to improve speech quality by training a TTS model under the supervision of perceptual loss.
We first pre-train a mean opinion score (MOS) prediction model and then train a TTS model to maximize the MOS of synthesized speech.
The proposed method can be applied independently regardless of the TTS model architecture or the cause of speech quality degradation.
arXiv Detail & Related papers (2020-11-02T18:13:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.