Evaluating Speech Synthesis by Training Recognizers on Synthetic Speech
- URL: http://arxiv.org/abs/2310.00706v1
- Date: Sun, 1 Oct 2023 15:52:48 GMT
- Title: Evaluating Speech Synthesis by Training Recognizers on Synthetic Speech
- Authors: Dareen Alharthi, Roshan Sharma, Hira Dhamyal, Soumi Maiti, Bhiksha
Raj, Rita Singh
- Abstract summary: We propose an evaluation technique involving the training of an ASR model on synthetic speech and assessing its performance on real speech.
Our proposed metric demonstrates a strong correlation with both MOS naturalness and MOS intelligibility when compared to SpeechLMScore and MOSNet.
- Score: 34.8899247119748
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern speech synthesis systems have improved significantly, with synthetic
speech being indistinguishable from real speech. However, efficient and
holistic evaluation of synthetic speech still remains a significant challenge.
Human evaluation using Mean Opinion Score (MOS) is ideal, but inefficient due
to high costs. Therefore, researchers have developed auxiliary automatic
metrics like Word Error Rate (WER) to measure intelligibility. Prior works
focus on evaluating synthetic speech based on pre-trained speech recognition
models, however, this can be limiting since this approach primarily measures
speech intelligibility. In this paper, we propose an evaluation technique
involving the training of an ASR model on synthetic speech and assessing its
performance on real speech. Our main assumption is that by training the ASR
model on the synthetic speech, the WER on real speech reflects the similarity
between distributions, a broader assessment of synthetic speech quality beyond
intelligibility. Our proposed metric demonstrates a strong correlation with
both MOS naturalness and MOS intelligibility when compared to SpeechLMScore and
MOSNet on three recent Text-to-Speech (TTS) systems: MQTTS, StyleTTS, and
YourTTS.
Related papers
- DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization [12.310318928818546]
We propose a novel method of distilling TTS diffusion models with direct end-to-end evaluation metric optimization.
We show DMDSpeech consistently surpasses prior state-of-the-art models in both naturalness and speaker similarity.
This work highlights the potential of direct metric optimization in speech synthesis, allowing models to better align with human auditory preferences.
arXiv Detail & Related papers (2024-10-14T21:17:58Z) - Towards Improving NAM-to-Speech Synthesis Intelligibility using Self-Supervised Speech Models [24.943609458024596]
We propose a novel approach to significantly improve the intelligibility in the Non-Audible Murmur (NAM)-to-speech conversion task.
Unlike conventional methods that explicitly record ground-truth speech, our methodology relies on self-supervision and speech-to-speech synthesis.
Our method surpasses the current state-of-the-art (SOTA) by 29.08% improvement in the Mel-Cepstral Distortion (MCD) metric.
arXiv Detail & Related papers (2024-07-26T06:44:01Z) - Improved Child Text-to-Speech Synthesis through Fastpitch-based Transfer
Learning [3.5032870024762386]
This paper presents a novel approach that leverages the Fastpitch text-to-speech (TTS) model for generating high-quality synthetic child speech.
The approach involved finetuning a multi-speaker TTS model to work with child speech.
We conducted an objective assessment that showed a significant correlation between real and synthetic child voices.
arXiv Detail & Related papers (2023-11-07T19:31:44Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech
Synthesis with Diffusion and Style-based Models [83.07390037152963]
ZET-Speech is a zero-shot adaptive emotion-controllable TTS model.
It allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label.
Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers.
arXiv Detail & Related papers (2023-05-23T08:52:00Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - A Text-to-Speech Pipeline, Evaluation Methodology, and Initial
Fine-Tuning Results for Child Speech Synthesis [3.2548794659022398]
Speech synthesis has come a long way as current text-to-speech (TTS) models can now generate natural human-sounding speech.
This study developed and validated a training pipeline for fine-tuning state-of-the-art neural TTS models using child speech datasets.
arXiv Detail & Related papers (2022-03-22T09:34:21Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z) - Reinforcement Learning for Emotional Text-to-Speech Synthesis with
Improved Emotion Discriminability [82.39099867188547]
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years.
We propose a new interactive training paradigm for ETTS, denoted as i-ETTS.
We formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization.
arXiv Detail & Related papers (2021-04-03T13:52:47Z) - Speech Synthesis as Augmentation for Low-Resource ASR [7.2244067948447075]
Speech synthesis might hold the key to low-resource speech recognition.
Data augmentation techniques have become an essential part of modern speech recognition training.
Speech synthesis techniques have been rapidly getting closer to the goal of achieving human-like speech.
arXiv Detail & Related papers (2020-12-23T22:19:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.