Related papers: A Multi-Stage Multi-Codebook VQ-VAE Approach to High-Performance Neural TTS

A Multi-Stage Multi-Codebook VQ-VAE Approach to High-Performance Neural TTS

URL: http://arxiv.org/abs/2209.10887v1
Date: Thu, 22 Sep 2022 09:43:17 GMT
Title: A Multi-Stage Multi-Codebook VQ-VAE Approach to High-Performance Neural TTS
Authors: Haohan Guo, Fenglong Xie, Frank K. Soong, Xixin Wu, Helen Meng
Abstract summary: We propose a Multi-Stage, Multi-Codebook (MSMC) approach to high-performance neural TTS synthesis. A vector-quantized, variational autoencoder (VQ-VAE) based feature analyzer is used to encode Mel spectrograms of speech training data. In synthesis, the neural vocoder converts the predicted MSMCRs into final speech waveforms.
Score: 52.51848317549301
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a Multi-Stage, Multi-Codebook (MSMC) approach to high-performance neural TTS synthesis. A vector-quantized, variational autoencoder (VQ-VAE) based feature analyzer is used to encode Mel spectrograms of speech training data by down-sampling progressively in multiple stages into MSMC Representations (MSMCRs) with different time resolutions, and quantizing them with multiple VQ codebooks, respectively. Multi-stage predictors are trained to map the input text sequence to MSMCRs progressively by minimizing a combined loss of the reconstruction Mean Square Error (MSE) and "triplet loss". In synthesis, the neural vocoder converts the predicted MSMCRs into final speech waveforms. The proposed approach is trained and tested with an English TTS database of 16 hours by a female speaker. The proposed TTS achieves an MOS score of 4.41, which outperforms the baseline with an MOS of 3.62. Compact versions of the proposed TTS with much less parameters can still preserve high MOS scores. Ablation studies show that both multiple stages and multiple codebooks are effective for achieving high TTS performance.

Related papers

TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer [6.1319363847980135]
TTS-Transducer is a novel architecture for text-to-speech, leveraging the strengths of audio models and neural transducers. We show that TTS-Transducer is a competitive and robust alternative to contemporary TTS systems.
arXiv Detail & Related papers (2025-01-10T19:50:32Z)
VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired. We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z)
EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech [4.91849983180793]
We propose a lightweight Text-to-Speech (TTS) system based on deep convolutional neural networks. Our model consists of two stages: Text2Spectrum and SSRN. Experiments show that our model can reduce the training time and parameters while ensuring the quality and naturalness of the synthesized speech.
arXiv Detail & Related papers (2024-03-13T01:27:57Z)
QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via Vector-Quantized Self-Supervised Speech Representation Learning [65.35080911787882]
This paper proposes a novel semi-supervised TTS framework, QS-TTS, to improve TTS quality with lower supervised data requirements. Two VQ-S3R learners provide profitable speech representations and pre-trained models for TTS. The results powerfully demonstrate the superior performance of QS-TTS, winning the highest MOS over supervised or semi-supervised baseline TTS approaches.
arXiv Detail & Related papers (2023-08-31T20:25:44Z)
Towards Robust FastSpeech 2 by Modelling Residual Multimodality [4.4904382374090765]
State-of-the-art non-autoregressive text-to-speech models based on FastSpeech 2 can efficiently synthesise high-fidelity and natural speech. We observe characteristic audio distortions in expressive speech datasets. TVC-GMM reduces spectrogram smoothness and improves perceptual audio quality in particular for expressive datasets.
arXiv Detail & Related papers (2023-06-02T11:03:26Z)
Multi-scale Transformer Network with Edge-aware Pre-training for Cross-Modality MR Image Synthesis [52.41439725865149]
Cross-modality magnetic resonance (MR) image synthesis can be used to generate missing modalities from given ones. Existing (supervised learning) methods often require a large number of paired multi-modal data to train an effective synthesis model. We propose a Multi-scale Transformer Network (MT-Net) with edge-aware pre-training for cross-modality MR image synthesis.
arXiv Detail & Related papers (2022-12-02T11:40:40Z)
MAST: Multiscale Audio Spectrogram Transformers [53.06337011259031]
We present Multiscale Audio Spectrogram Transformer (MAST) for audio classification, which brings the concept of multiscale feature hierarchies to the Audio Spectrogram Transformer (AST) In practice, MAST significantly outperforms AST by an average accuracy of 3.4% across 8 speech and non-speech tasks from the LAPE Benchmark.
arXiv Detail & Related papers (2022-11-02T23:34:12Z)
Towards High-Quality Neural TTS for Low-Resource Languages by Learning Compact Speech Representations [43.31594896204752]
This paper aims to enhance low-resource TTS by reducing training data requirements using compact speech representations. A Multi-Stage Multi-Codebook (MSMC) VQ-GAN is trained to learn the representation, MSMCR, and decode it to waveforms. We optimize the training strategy by leveraging more audio to learn MSMCRs better for low-resource languages.
arXiv Detail & Related papers (2022-10-27T02:32:00Z)
Streaming Multi-Talker ASR with Token-Level Serialized Output Training [53.11450530896623]
t-SOT is a novel framework for streaming multi-talker automatic speech recognition. The t-SOT model has the advantages of less inference cost and a simpler model architecture. For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2022-02-02T01:27:21Z)
Multi-view Frequency LSTM: An Efficient Frontend for Automatic Speech Recognition [4.753402561130792]
We present a simple and efficient modification by combining the outputs of multiple FLSTM stacks with different views. We show that the multi-view FLSTM acoustic model provides relative Word Error Rate (WER) improvements of 3-7% for different speaker and acoustic environment scenarios.
arXiv Detail & Related papers (2020-06-30T22:19:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.