Improved Child Text-to-Speech Synthesis through Fastpitch-based Transfer
Learning
- URL: http://arxiv.org/abs/2311.04313v1
- Date: Tue, 7 Nov 2023 19:31:44 GMT
- Title: Improved Child Text-to-Speech Synthesis through Fastpitch-based Transfer
Learning
- Authors: Rishabh Jain and Peter Corcoran
- Abstract summary: This paper presents a novel approach that leverages the Fastpitch text-to-speech (TTS) model for generating high-quality synthetic child speech.
The approach involved finetuning a multi-speaker TTS model to work with child speech.
We conducted an objective assessment that showed a significant correlation between real and synthetic child voices.
- Score: 3.5032870024762386
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech synthesis technology has witnessed significant advancements in recent
years, enabling the creation of natural and expressive synthetic speech. One
area of particular interest is the generation of synthetic child speech, which
presents unique challenges due to children's distinct vocal characteristics and
developmental stages. This paper presents a novel approach that leverages the
Fastpitch text-to-speech (TTS) model for generating high-quality synthetic
child speech. This study uses the transfer learning training pipeline. The
approach involved finetuning a multi-speaker TTS model to work with child
speech. We use the cleaned version of the publicly available MyST dataset (55
hours) for our finetuning experiments. We also release a prototype dataset of
synthetic speech samples generated from this research together with model code
to support further research. By using a pretrained MOSNet, we conducted an
objective assessment that showed a significant correlation between real and
synthetic child voices. Additionally, to validate the intelligibility of the
generated speech, we employed an automatic speech recognition (ASR) model to
compare the word error rates (WER) of real and synthetic child voices. The
speaker similarity between the real and generated speech is also measured using
a pretrained speaker encoder.
Related papers
- Multi-modal Adversarial Training for Zero-Shot Voice Cloning [9.823246184635103]
We propose a Transformer encoder-decoder architecture to conditionally discriminate between real and generated speech features.
We introduce our novel adversarial training technique by applying it to a FastSpeech2 acoustic model and training on Libriheavy, a large multi-speaker dataset.
Our model achieves improvements over the baseline in terms of speech quality and speaker similarity.
arXiv Detail & Related papers (2024-08-28T16:30:41Z) - TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - Toward Joint Language Modeling for Speech Units and Text [89.32163954508489]
We explore joint language modeling for speech units and text.
We introduce automatic metrics to evaluate how well the joint LM mixes speech and text.
Our results show that by mixing speech units and text with our proposed mixing techniques, the joint LM improves over a speech-only baseline on SLU tasks.
arXiv Detail & Related papers (2023-10-12T20:53:39Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - How Generative Spoken Language Modeling Encodes Noisy Speech:
Investigation from Phonetics to Syntactics [33.070158866023]
generative spoken language modeling (GSLM) involves using learned symbols derived from data rather than phonemes for speech analysis and synthesis.
This paper presents the findings of GSLM's encoding and decoding effectiveness at the spoken-language and speech levels.
arXiv Detail & Related papers (2023-06-01T14:07:19Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model.
It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech.
It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z) - Text-To-Speech Data Augmentation for Low Resource Speech Recognition [0.0]
This research proposes a new data augmentation method to improve ASR models for agglutinative and low-resource languages.
Experiments were conducted using the corpus of the Quechua language, which is an agglutinative and low-resource language.
An 8.73% improvement in the word-error-rate (WER) of the ASR model is obtained using a combination of synthetic text and synthetic speech.
arXiv Detail & Related papers (2022-04-01T08:53:44Z) - A Text-to-Speech Pipeline, Evaluation Methodology, and Initial
Fine-Tuning Results for Child Speech Synthesis [3.2548794659022398]
Speech synthesis has come a long way as current text-to-speech (TTS) models can now generate natural human-sounding speech.
This study developed and validated a training pipeline for fine-tuning state-of-the-art neural TTS models using child speech datasets.
arXiv Detail & Related papers (2022-03-22T09:34:21Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z) - Noise Robust TTS for Low Resource Speakers using Pre-trained Model and
Speech Enhancement [31.33429812278942]
The proposed end-to-end speech synthesis model uses both speaker embedding and noise representation as conditional inputs to model speaker and noise information respectively.
Experimental results show that the speech generated by the proposed approach has better subjective evaluation results than the method directly fine-tuning multi-speaker speech synthesis model.
arXiv Detail & Related papers (2020-05-26T06:14:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.