AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data
- URL: http://arxiv.org/abs/2104.09715v1
- Date: Tue, 20 Apr 2021 01:53:30 GMT
- Title: AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data
- Authors: Yuzi Yan, Xu Tan, Bohan Li, Tao Qin, Sheng Zhao, Yuan Shen, Tie-Yan
Liu
- Abstract summary: We develop AdaSpeech 2, an adaptive TTS system that only leverages untranscribed speech data for adaptation.
Specifically, we introduce a mel-spectrogram encoder to a well-trained TTS model to conduct speech reconstruction.
In adaptation, we use untranscribed speech data for speech reconstruction and only fine-tune the TTS decoder.
- Score: 115.38309338462588
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text to speech (TTS) is widely used to synthesize personal voice for a target
speaker, where a well-trained source TTS model is fine-tuned with few paired
adaptation data (speech and its transcripts) on this target speaker. However,
in many scenarios, only untranscribed speech data is available for adaptation,
which brings challenges to the previous TTS adaptation pipelines (e.g.,
AdaSpeech). In this paper, we develop AdaSpeech 2, an adaptive TTS system that
only leverages untranscribed speech data for adaptation. Specifically, we
introduce a mel-spectrogram encoder to a well-trained TTS model to conduct
speech reconstruction, and at the same time constrain the output sequence of
the mel-spectrogram encoder to be close to that of the original phoneme
encoder. In adaptation, we use untranscribed speech data for speech
reconstruction and only fine-tune the TTS decoder. AdaSpeech 2 has two
advantages: 1) Pluggable: our system can be easily applied to existing trained
TTS models without re-training. 2) Effective: our system achieves on-par voice
quality with the transcribed TTS adaptation (e.g., AdaSpeech) with the same
amount of untranscribed data, and achieves better voice quality than previous
untranscribed adaptation methods. Synthesized speech samples can be found at
https://speechresearch.github.io/adaspeech2/.
Related papers
- Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis [30.97784092953007]
This paper investigates the use of unsupervised text-to-speech synthesis (TTS) as a data augmentation method to improve accented speech recognition.
TTS systems are trained with a small amount of accented speech training data and their pseudo-labels rather than manual transcriptions.
This approach enables the use of accented speech data without manual transcriptions to perform data augmentation for accented speech recognition.
arXiv Detail & Related papers (2024-07-04T16:42:24Z) - Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data? [49.42189569058647]
Two-pass direct speech-to-speech translation (S2ST) models decompose the task into speech-to-text translation (S2TT) and text-to-speech (TTS)
In this paper, we introduce a composite S2ST model named ComSpeech, which can seamlessly integrate any pretrained S2TT and TTS models into a direct S2ST model.
We also propose a novel training method ComSpeech-ZS that solely utilizes S2TT and TTS data.
arXiv Detail & Related papers (2024-06-11T14:17:12Z) - Guided-TTS 2: A Diffusion Model for High-quality Adaptive Text-to-Speech
with Untranscribed Data [25.709370310448328]
We propose Guided-TTS 2, a diffusion-based generative model for high-quality adaptive TTS using untranscribed data.
We train the speaker-conditional diffusion model on large-scale untranscribed datasets for a classifier-free guidance method.
We demonstrate that Guided-TTS 2 shows comparable performance to high-quality single-speaker TTS baselines in terms of speech quality and speaker similarity with only a tensecond untranscribed data.
arXiv Detail & Related papers (2022-05-30T18:30:20Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Voice Filter: Few-shot text-to-speech speaker adaptation using voice
conversion as a post-processing module [16.369219400819134]
State-of-the-art text-to-speech (TTS) systems require several hours of recorded speech data to generate high-quality synthetic speech.
When using reduced amounts of training data, standard TTS models suffer from speech quality and intelligibility degradations.
We propose a novel extremely low-resource TTS method called Voice Filter that uses as little as one minute of speech from a target speaker.
arXiv Detail & Related papers (2022-02-16T16:12:21Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z) - Guided-TTS:Text-to-Speech with Untranscribed Speech [22.548875263927396]
We present Guided-TTS, a high-quality TTS model that learns to generate speech from untranscribed speech data.
For text-to-speech synthesis, we guide the generative process of the unconditional DDPM via phoneme classification to produce mel-spectrograms.
arXiv Detail & Related papers (2021-11-23T10:05:05Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - AdaSpeech: Adaptive Text to Speech for Custom Voice [104.69219752194863]
We propose AdaSpeech, an adaptive TTS system for high-quality and efficient customization of new voices.
Experiment results show that AdaSpeech achieves much better adaptation quality than baseline methods, with only about 5K specific parameters for each speaker.
arXiv Detail & Related papers (2021-03-01T13:28:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.