VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech
with Adversarial Learning and Architecture Design
- URL: http://arxiv.org/abs/2307.16430v1
- Date: Mon, 31 Jul 2023 06:36:44 GMT
- Title: VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech
with Adversarial Learning and Architecture Design
- Authors: Jungil Kong, Jihoon Park, Beomjeong Kim, Jeongmin Kim, Dohee Kong,
Sangjin Kim
- Abstract summary: We introduce VITS2, a single-stage text-to-speech model that efficiently synthesizes a more natural speech.
We propose improved structures and training mechanisms and present that the proposed methods are effective in improving naturalness.
We demonstrate that the strong dependence on phoneme conversion in previous works can be significantly reduced with our method.
- Score: 7.005639198341213
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Single-stage text-to-speech models have been actively studied recently, and
their results have outperformed two-stage pipeline systems. Although the
previous single-stage model has made great progress, there is room for
improvement in terms of its intermittent unnaturalness, computational
efficiency, and strong dependence on phoneme conversion. In this work, we
introduce VITS2, a single-stage text-to-speech model that efficiently
synthesizes a more natural speech by improving several aspects of the previous
work. We propose improved structures and training mechanisms and present that
the proposed methods are effective in improving naturalness, similarity of
speech characteristics in a multi-speaker model, and efficiency of training and
inference. Furthermore, we demonstrate that the strong dependence on phoneme
conversion in previous works can be significantly reduced with our method,
which allows a fully end-to-end single-stage approach.
Related papers
- Single-stage TTS with Masked Audio Token Modeling and Semantic Knowledge Distillation [6.813336394564509]
We introduce a semantic knowledge distillation method that enables high-quality speech generation in a single stage.
Our proposed model improves speech quality, intelligibility, and speaker similarity compared to a single-stage baseline.
arXiv Detail & Related papers (2024-09-17T09:08:43Z) - SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [64.40250409933752]
We build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2.
SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods.
We show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models.
arXiv Detail & Related papers (2024-08-25T17:07:39Z) - Pre-training Feature Guided Diffusion Model for Speech Enhancement [37.88469730135598]
Speech enhancement significantly improves the clarity and intelligibility of speech in noisy environments.
We introduce a novel pretraining feature-guided diffusion model tailored for efficient speech enhancement.
arXiv Detail & Related papers (2024-06-11T18:22:59Z) - Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition [12.77573161345651]
This paper proposes integrating a pre-trained speech representation model and a large language model (LLM) for E2E ASR.
The proposed model enables the optimization of the entire ASR process, including acoustic feature extraction and acoustic and language modeling.
arXiv Detail & Related papers (2023-12-06T18:34:42Z) - Expedited Training of Visual Conditioned Language Generation via
Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models.
We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z) - BatGPT: A Bidirectional Autoregessive Talker from Generative Pre-trained
Transformer [77.28871523946418]
BatGPT is a large-scale language model designed and trained jointly by Wuhan University and Shanghai Jiao Tong University.
It is capable of generating highly natural and fluent text in response to various types of input, including text prompts, images, and audio.
arXiv Detail & Related papers (2023-07-01T15:10:01Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - An Investigation of End-to-End Models for Robust Speech Recognition [20.998349142078805]
We present a comparison of speech enhancement-based techniques and three different model-based adaptation techniques for robust automatic speech recognition.
While adversarial learning is the best-performing technique on certain noise types, it comes at the cost of degrading clean speech WER.
On other relatively stationary noise types, a new speech enhancement technique outperformed all the model-based adaptation techniques.
arXiv Detail & Related papers (2021-02-11T19:47:13Z) - Phoneme Based Neural Transducer for Large Vocabulary Speech Recognition [41.92991390542083]
We present a simple, novel and competitive approach for phoneme-based neural transducer modeling.
A phonetic context size of one is shown to be sufficient for the best performance.
The overall performance of our best model is comparable to state-of-the-art (SOTA) results for the TED-LIUM Release 2 and Switchboard corpora.
arXiv Detail & Related papers (2020-10-30T16:53:29Z) - Filling the Gap of Utterance-aware and Speaker-aware Representation for
Multi-turn Dialogue [76.88174667929665]
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles.
In the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely.
We propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history.
arXiv Detail & Related papers (2020-09-14T15:07:19Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.