End-to-End Text-to-Speech using Latent Duration based on VQ-VAE
- URL: http://arxiv.org/abs/2010.09602v2
- Date: Tue, 20 Oct 2020 13:46:35 GMT
- Title: End-to-End Text-to-Speech using Latent Duration based on VQ-VAE
- Authors: Yusuke Yasuda, Xin Wang, Junichi Yamagishi
- Abstract summary: Explicit duration modeling is a key to achieving robust and efficient alignment in text-to-speech synthesis (TTS)
We propose a new TTS framework using explicit duration modeling that incorporates duration as a discrete latent variable to TTS.
- Score: 48.151894340550385
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Explicit duration modeling is a key to achieving robust and efficient
alignment in text-to-speech synthesis (TTS). We propose a new TTS framework
using explicit duration modeling that incorporates duration as a discrete
latent variable to TTS and enables joint optimization of whole modules from
scratch. We formulate our method based on conditional VQ-VAE to handle discrete
duration in a variational autoencoder and provide a theoretical explanation to
justify our method. In our framework, a connectionist temporal classification
(CTC) -based force aligner acts as the approximate posterior, and
text-to-duration works as the prior in the variational autoencoder. We
evaluated our proposed method with a listening test and compared it with other
TTS methods based on soft-attention or explicit duration modeling. The results
showed that our systems rated between soft-attention-based methods
(Transformer-TTS, Tacotron2) and explicit duration modeling-based methods
(Fastspeech).
Related papers
- VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer [9.032701216955497]
We present an efficient and scalable Diffusion Transformer (DiT) that utilizes off-the-shelf pre-trained text and speech encoders.
Our approach addresses the challenge of text-speech alignment via cross-attention mechanisms with the prediction of the total length of speech representations.
We scale the training dataset and the model size to 82K hours and 790M parameters, respectively.
arXiv Detail & Related papers (2024-06-17T11:25:57Z) - Minimally-Supervised Speech Synthesis with Conditional Diffusion Model
and Language Model: A Comparative Study of Semantic Coding [57.42429912884543]
We propose Diff-LM-Speech, Tetra-Diff-Speech and Tri-Diff-Speech to solve high dimensionality and waveform distortion problems.
We also introduce a prompt encoder structure based on a variational autoencoder and a prosody bottleneck to improve prompt representation ability.
Experimental results show that our proposed methods outperform baseline methods.
arXiv Detail & Related papers (2023-07-28T11:20:23Z) - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - Unsupervised TTS Acoustic Modeling for TTS with Conditional Disentangled Sequential VAE [36.50265124324876]
We propose a novel unsupervised text-to-speech acoustic model training scheme, named UTTS, which does not require text-audio pairs.
The framework offers a flexible choice of a speaker's duration model, timbre feature (identity) and content for TTS inference.
Experiments demonstrate that UTTS can synthesize speech of high naturalness and intelligibility measured by human and objective evaluations.
arXiv Detail & Related papers (2022-06-06T11:51:22Z) - Differentiable Duration Modeling for End-to-End Text-to-Speech [6.571447892202893]
parallel text-to-speech (TTS) models have recently enabled fast and highly-natural speech synthesis.
We propose a differentiable duration method for learning monotonic sequences between input and output.
Our model learns to perform high-fidelity synthesis through a combination of adversarial training and matching the total ground-truth duration.
arXiv Detail & Related papers (2022-03-21T15:14:44Z) - With a Little Help from my Temporal Context: Multimodal Egocentric
Action Recognition [95.99542238790038]
We propose a method that learns to attend to surrounding actions in order to improve recognition performance.
To incorporate the temporal context, we propose a transformer-based multimodal model that ingests video and audio as input modalities.
We test our approach on EPIC-KITCHENS and EGTEA datasets reporting state-of-the-art performance.
arXiv Detail & Related papers (2021-11-01T15:27:35Z) - Low-Latency Incremental Text-to-Speech Synthesis with Distilled Context
Prediction Network [41.4599368523939]
We propose an incremental TTS method that directly predicts the unobserved future context with a lightweight model.
Experimental results show that the proposed method requires about ten times less inference time to achieve comparable synthetic speech quality.
arXiv Detail & Related papers (2021-09-22T13:29:10Z) - STYLER: Style Modeling with Rapidity and Robustness via
SpeechDecomposition for Expressive and Controllable Neural Text to Speech [2.622482339911829]
STYLER is a novel expressive text-to-speech model with parallelized architecture.
Our novel noise modeling approach from audio using domain adversarial training and Residual Decoding enabled style transfer without transferring noise.
arXiv Detail & Related papers (2021-03-17T07:11:09Z) - Boosting Continuous Sign Language Recognition via Cross Modality
Augmentation [135.30357113518127]
Continuous sign language recognition deals with unaligned video-text pair.
We propose a novel architecture with cross modality augmentation.
The proposed framework can be easily extended to other existing CTC based continuous SLR architectures.
arXiv Detail & Related papers (2020-10-11T15:07:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.