UTTS: Unsupervised TTS with Conditional Disentangled Sequential
Variational Auto-encoder
- URL: http://arxiv.org/abs/2206.02512v2
- Date: Tue, 7 Jun 2022 01:30:17 GMT
- Title: UTTS: Unsupervised TTS with Conditional Disentangled Sequential
Variational Auto-encoder
- Authors: Jiachen Lian and Chunlei Zhang and Gopala Krishna Anumanchipalli and
Dong Yu
- Abstract summary: We propose a novel unsupervised text-to-speech (UTTS) framework which does not require text-audio pairs for the TTS acoustic modeling (AM)
The framework offers a flexible choice of a speaker's duration model, timbre feature (identity) and content for TTS inference.
Experiments demonstrate that UTTS can synthesize speech of high naturalness and intelligibility measured by human and objective evaluations.
- Score: 30.376259456529368
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose a novel unsupervised text-to-speech (UTTS)
framework which does not require text-audio pairs for the TTS acoustic modeling
(AM). UTTS is a multi-speaker speech synthesizer developed from the perspective
of disentangled speech representation learning. The framework offers a flexible
choice of a speaker's duration model, timbre feature (identity) and content for
TTS inference. We leverage recent advancements in self-supervised speech
representation learning as well as speech synthesis front-end techniques for
the system development. Specifically, we utilize a lexicon to map input text to
the phoneme sequence, which is expanded to the frame-level forced alignment
(FA) with a speaker-dependent duration model. Then, we develop an alignment
mapping module that converts the FA to the unsupervised alignment (UA).
Finally, a Conditional Disentangled Sequential Variational Auto-encoder
(C-DSVAE), serving as the self-supervised TTS AM, takes the predicted UA and a
target speaker embedding to generate the mel spectrogram, which is ultimately
converted to waveform with a neural vocoder. We show how our method enables
speech synthesis without using a paired TTS corpus. Experiments demonstrate
that UTTS can synthesize speech of high naturalness and intelligibility
measured by human and objective evaluations.
Related papers
- VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and
Pause-based Prosody Modeling [25.966328901566815]
We propose PuaseSpeech, a speech synthesis system with a pre-trained language model and pause-based prosody modeling.
Experimental results show PauseSpeech outperforms previous models in terms of naturalness.
arXiv Detail & Related papers (2023-06-13T01:36:55Z) - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers [92.55131711064935]
We introduce a language modeling approach for text to speech synthesis (TTS)
Specifically, we train a neural language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio model.
Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech.
arXiv Detail & Related papers (2023-01-05T15:37:15Z) - Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model.
It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech.
It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - Into-TTS : Intonation Template based Prosody Control System [17.68906373821669]
Intonations take an important role in delivering the intention of the speaker.
Current end-to-end TTS systems often fail to model proper intonations.
We propose a novel, intuitive method to synthesize speech in different intonations.
arXiv Detail & Related papers (2022-04-04T06:37:19Z) - JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to
Speech [7.476901945542385]
We present end-to-end text-to-speech (E2E-TTS) model which has a simplified training pipeline and outperforms a cascade of separately learned models.
Our proposed model is jointly trained FastSpeech2 and HiFi-GAN with an alignment module.
Experiments on LJSpeech corpus shows that the proposed model outperforms publicly available, state-of-the-art implementations of ESPNet2-TTS.
arXiv Detail & Related papers (2022-03-31T07:25:11Z) - NAUTILUS: a Versatile Voice Cloning System [44.700803634034486]
NAUTILUS can generate speech with a target voice either from a text input or a reference utterance of an arbitrary source speaker.
It can clone unseen voices using untranscribed speech of target speakers on the basis of the backpropagation algorithm.
It achieves comparable quality with state-of-the-art TTS and VC systems when cloning with just five minutes of untranscribed speech.
arXiv Detail & Related papers (2020-05-22T05:00:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.