Guided-TTS:Text-to-Speech with Untranscribed Speech
- URL: http://arxiv.org/abs/2111.11755v1
- Date: Tue, 23 Nov 2021 10:05:05 GMT
- Title: Guided-TTS:Text-to-Speech with Untranscribed Speech
- Authors: Heeseung Kim, Sungwon Kim, Sungroh Yoon
- Abstract summary: We present Guided-TTS, a high-quality TTS model that learns to generate speech from untranscribed speech data.
For text-to-speech synthesis, we guide the generative process of the unconditional DDPM via phoneme classification to produce mel-spectrograms.
- Score: 22.548875263927396
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most neural text-to-speech (TTS) models require <speech, transcript> paired
data from the desired speaker for high-quality speech synthesis, which limits
the usage of large amounts of untranscribed data for training. In this work, we
present Guided-TTS, a high-quality TTS model that learns to generate speech
from untranscribed speech data. Guided-TTS combines an unconditional diffusion
probabilistic model with a separately trained phoneme classifier for
text-to-speech. By modeling the unconditional distribution for speech, our
model can utilize the untranscribed data for training. For text-to-speech
synthesis, we guide the generative process of the unconditional DDPM via
phoneme classification to produce mel-spectrograms from the conditional
distribution given transcript. We show that Guided-TTS achieves comparable
performance with the existing methods without any transcript for LJSpeech. Our
results further show that a single speaker-dependent phoneme classifier trained
on multispeaker large-scale data can guide unconditional DDPMs for various
speakers to perform TTS.
Related papers
- Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - Unsupervised Pre-Training For Data-Efficient Text-to-Speech On Low
Resource Languages [15.32264927462068]
We propose an unsupervised pre-training method for a sequence-to-sequence TTS model by leveraging large untranscribed speech data.
The main idea is to pre-train the model to reconstruct de-warped mel-spectrograms from warped ones.
We empirically demonstrate the effectiveness of our proposed method in low-resource language scenarios.
arXiv Detail & Related papers (2023-03-28T01:26:00Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - Guided-TTS 2: A Diffusion Model for High-quality Adaptive Text-to-Speech
with Untranscribed Data [25.709370310448328]
We propose Guided-TTS 2, a diffusion-based generative model for high-quality adaptive TTS using untranscribed data.
We train the speaker-conditional diffusion model on large-scale untranscribed datasets for a classifier-free guidance method.
We demonstrate that Guided-TTS 2 shows comparable performance to high-quality single-speaker TTS baselines in terms of speech quality and speaker similarity with only a tensecond untranscribed data.
arXiv Detail & Related papers (2022-05-30T18:30:20Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Transfer Learning Framework for Low-Resource Text-to-Speech using a
Large-Scale Unlabeled Speech Corpus [10.158584616360669]
Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus.
We propose a transfer learning framework for TTS that utilizes a large amount of unlabeled speech dataset for pre-training.
arXiv Detail & Related papers (2022-03-29T11:26:56Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data [115.38309338462588]
We develop AdaSpeech 2, an adaptive TTS system that only leverages untranscribed speech data for adaptation.
Specifically, we introduce a mel-spectrogram encoder to a well-trained TTS model to conduct speech reconstruction.
In adaptation, we use untranscribed speech data for speech reconstruction and only fine-tune the TTS decoder.
arXiv Detail & Related papers (2021-04-20T01:53:30Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.