A Multi-Stage Multi-Codebook VQ-VAE Approach to High-Performance Neural
TTS
- URL: http://arxiv.org/abs/2209.10887v1
- Date: Thu, 22 Sep 2022 09:43:17 GMT
- Title: A Multi-Stage Multi-Codebook VQ-VAE Approach to High-Performance Neural
TTS
- Authors: Haohan Guo, Fenglong Xie, Frank K. Soong, Xixin Wu, Helen Meng
- Abstract summary: We propose a Multi-Stage, Multi-Codebook (MSMC) approach to high-performance neural TTS synthesis.
A vector-quantized, variational autoencoder (VQ-VAE) based feature analyzer is used to encode Mel spectrograms of speech training data.
In synthesis, the neural vocoder converts the predicted MSMCRs into final speech waveforms.
- Score: 52.51848317549301
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a Multi-Stage, Multi-Codebook (MSMC) approach to high-performance
neural TTS synthesis. A vector-quantized, variational autoencoder (VQ-VAE)
based feature analyzer is used to encode Mel spectrograms of speech training
data by down-sampling progressively in multiple stages into MSMC
Representations (MSMCRs) with different time resolutions, and quantizing them
with multiple VQ codebooks, respectively. Multi-stage predictors are trained to
map the input text sequence to MSMCRs progressively by minimizing a combined
loss of the reconstruction Mean Square Error (MSE) and "triplet loss". In
synthesis, the neural vocoder converts the predicted MSMCRs into final speech
waveforms. The proposed approach is trained and tested with an English TTS
database of 16 hours by a female speaker. The proposed TTS achieves an MOS
score of 4.41, which outperforms the baseline with an MOS of 3.62. Compact
versions of the proposed TTS with much less parameters can still preserve high
MOS scores. Ablation studies show that both multiple stages and multiple
codebooks are effective for achieving high TTS performance.
Related papers
- VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech [4.91849983180793]
We propose a lightweight Text-to-Speech (TTS) system based on deep convolutional neural networks.
Our model consists of two stages: Text2Spectrum and SSRN.
Experiments show that our model can reduce the training time and parameters while ensuring the quality and naturalness of the synthesized speech.
arXiv Detail & Related papers (2024-03-13T01:27:57Z) - QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via
Vector-Quantized Self-Supervised Speech Representation Learning [65.35080911787882]
This paper proposes a novel semi-supervised TTS framework, QS-TTS, to improve TTS quality with lower supervised data requirements.
Two VQ-S3R learners provide profitable speech representations and pre-trained models for TTS.
The results powerfully demonstrate the superior performance of QS-TTS, winning the highest MOS over supervised or semi-supervised baseline TTS approaches.
arXiv Detail & Related papers (2023-08-31T20:25:44Z) - Towards Robust FastSpeech 2 by Modelling Residual Multimodality [4.4904382374090765]
State-of-the-art non-autoregressive text-to-speech models based on FastSpeech 2 can efficiently synthesise high-fidelity and natural speech.
We observe characteristic audio distortions in expressive speech datasets.
TVC-GMM reduces spectrogram smoothness and improves perceptual audio quality in particular for expressive datasets.
arXiv Detail & Related papers (2023-06-02T11:03:26Z) - Multi-scale Transformer Network with Edge-aware Pre-training for
Cross-Modality MR Image Synthesis [52.41439725865149]
Cross-modality magnetic resonance (MR) image synthesis can be used to generate missing modalities from given ones.
Existing (supervised learning) methods often require a large number of paired multi-modal data to train an effective synthesis model.
We propose a Multi-scale Transformer Network (MT-Net) with edge-aware pre-training for cross-modality MR image synthesis.
arXiv Detail & Related papers (2022-12-02T11:40:40Z) - MAST: Multiscale Audio Spectrogram Transformers [53.06337011259031]
We present Multiscale Audio Spectrogram Transformer (MAST) for audio classification, which brings the concept of multiscale feature hierarchies to the Audio Spectrogram Transformer (AST)
In practice, MAST significantly outperforms AST by an average accuracy of 3.4% across 8 speech and non-speech tasks from the LAPE Benchmark.
arXiv Detail & Related papers (2022-11-02T23:34:12Z) - Towards High-Quality Neural TTS for Low-Resource Languages by Learning
Compact Speech Representations [43.31594896204752]
This paper aims to enhance low-resource TTS by reducing training data requirements using compact speech representations.
A Multi-Stage Multi-Codebook (MSMC) VQ-GAN is trained to learn the representation, MSMCR, and decode it to waveforms.
We optimize the training strategy by leveraging more audio to learn MSMCRs better for low-resource languages.
arXiv Detail & Related papers (2022-10-27T02:32:00Z) - Streaming Multi-Talker ASR with Token-Level Serialized Output Training [53.11450530896623]
t-SOT is a novel framework for streaming multi-talker automatic speech recognition.
The t-SOT model has the advantages of less inference cost and a simpler model architecture.
For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2022-02-02T01:27:21Z) - Multi-view Frequency LSTM: An Efficient Frontend for Automatic Speech
Recognition [4.753402561130792]
We present a simple and efficient modification by combining the outputs of multiple FLSTM stacks with different views.
We show that the multi-view FLSTM acoustic model provides relative Word Error Rate (WER) improvements of 3-7% for different speaker and acoustic environment scenarios.
arXiv Detail & Related papers (2020-06-30T22:19:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.