A Text-to-Speech Pipeline, Evaluation Methodology, and Initial
Fine-Tuning Results for Child Speech Synthesis
- URL: http://arxiv.org/abs/2203.11562v1
- Date: Tue, 22 Mar 2022 09:34:21 GMT
- Title: A Text-to-Speech Pipeline, Evaluation Methodology, and Initial
Fine-Tuning Results for Child Speech Synthesis
- Authors: Rishabh Jain and Mariam Yiwere and Dan Bigioi and Peter Corcoran and
Horia Cucu
- Abstract summary: Speech synthesis has come a long way as current text-to-speech (TTS) models can now generate natural human-sounding speech.
This study developed and validated a training pipeline for fine-tuning state-of-the-art neural TTS models using child speech datasets.
- Score: 3.2548794659022398
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech synthesis has come a long way as current text-to-speech (TTS) models
can now generate natural human-sounding speech. However, most of the TTS
research focuses on using adult speech data and there has been very limited
work done on child speech synthesis. This study developed and validated a
training pipeline for fine-tuning state-of-the-art (SOTA) neural TTS models
using child speech datasets. This approach adopts a multispeaker TTS retuning
workflow to provide a transfer-learning pipeline. A publicly available child
speech dataset was cleaned to provide a smaller subset of approximately 19
hours, which formed the basis of our fine-tuning experiments. Both subjective
and objective evaluations were performed using a pretrained MOSNet for
objective evaluation and a novel subjective framework for mean opinion score
(MOS) evaluations. Subjective evaluations achieved the MOS of 3.92 for speech
intelligibility, 3.85 for voice naturalness, and 3.96 for voice consistency.
Objective evaluation using a pretrained MOSNet showed a strong correlation
between real and synthetic child voices. The final trained model was able to
synthesize child-like speech from reference audio samples as short as 5
seconds.
Related papers
- Pheme: Efficient and Conversational Speech Generation [52.34331755341856]
We introduce the Pheme model series that offers compact yet high-performing conversational TTS models.
It can be trained efficiently on smaller-scale conversational data, cutting data demands by more than 10x but still matching the quality of the autoregressive TTS models.
arXiv Detail & Related papers (2024-01-05T14:47:20Z) - Improved Child Text-to-Speech Synthesis through Fastpitch-based Transfer
Learning [3.5032870024762386]
This paper presents a novel approach that leverages the Fastpitch text-to-speech (TTS) model for generating high-quality synthetic child speech.
The approach involved finetuning a multi-speaker TTS model to work with child speech.
We conducted an objective assessment that showed a significant correlation between real and synthetic child voices.
arXiv Detail & Related papers (2023-11-07T19:31:44Z) - Evaluating Speech Synthesis by Training Recognizers on Synthetic Speech [34.8899247119748]
We propose an evaluation technique involving the training of an ASR model on synthetic speech and assessing its performance on real speech.
Our proposed metric demonstrates a strong correlation with both MOS naturalness and MOS intelligibility when compared to SpeechLMScore and MOSNet.
arXiv Detail & Related papers (2023-10-01T15:52:48Z) - Time out of Mind: Generating Rate of Speech conditioned on emotion and
speaker [0.0]
We train a GAN conditioned on emotion to generate worth lengths for a given input text.
These word lengths are relative neutral speech and can be provided to a text-to-speech system to generate more expressive speech.
We were able to achieve better performances on objective measures for neutral speech, and better time alignment for happy speech when compared to an out-of-box model.
arXiv Detail & Related papers (2023-01-29T02:58:01Z) - Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model.
It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech.
It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z) - Synthesizing Personalized Non-speech Vocalization from Discrete Speech
Representations [3.0016140723286457]
We formulated non-speech vocalization (NSV) modeling as a text-to-speech task and verified its viability.
Specifically, we evaluated the phonetic expressivity of HUBERT speech units on NSVs and verified our model's ability to control over speaker timbre.
arXiv Detail & Related papers (2022-06-25T14:27:10Z) - SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural
Text-to-Speech Synthesis [50.236929707024245]
The SOMOS dataset is the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples.
It consists of 20K synthetic utterances of the LJ Speech voice, a public domain speech dataset.
arXiv Detail & Related papers (2022-04-06T18:45:20Z) - JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to
Speech [7.476901945542385]
We present end-to-end text-to-speech (E2E-TTS) model which has a simplified training pipeline and outperforms a cascade of separately learned models.
Our proposed model is jointly trained FastSpeech2 and HiFi-GAN with an alignment module.
Experiments on LJSpeech corpus shows that the proposed model outperforms publicly available, state-of-the-art implementations of ESPNet2-TTS.
arXiv Detail & Related papers (2022-03-31T07:25:11Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style [111.89762723159677]
We develop AdaSpeech 3, an adaptive TTS system that fine-tunes a well-trained reading-style TTS model for spontaneous-style speech.
AdaSpeech 3 synthesizes speech with natural FP and rhythms in spontaneous styles, and achieves much better MOS and SMOS scores than previous adaptive TTS systems.
arXiv Detail & Related papers (2021-07-06T10:40:45Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.