Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
- URL: http://arxiv.org/abs/2301.02111v1
- Date: Thu, 5 Jan 2023 15:37:15 GMT
- Title: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
- Authors: Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie
Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, Furu
Wei
- Abstract summary: We introduce a language modeling approach for text to speech synthesis (TTS)
Specifically, we train a neural language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio model.
Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech.
- Score: 92.55131711064935
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a language modeling approach for text to speech synthesis (TTS).
Specifically, we train a neural codec language model (called Vall-E) using
discrete codes derived from an off-the-shelf neural audio codec model, and
regard TTS as a conditional language modeling task rather than continuous
signal regression as in previous work. During the pre-training stage, we scale
up the TTS training data to 60K hours of English speech which is hundreds of
times larger than existing systems. Vall-E emerges in-context learning
capabilities and can be used to synthesize high-quality personalized speech
with only a 3-second enrolled recording of an unseen speaker as an acoustic
prompt. Experiment results show that Vall-E significantly outperforms the
state-of-the-art zero-shot TTS system in terms of speech naturalness and
speaker similarity. In addition, we find Vall-E could preserve the speaker's
emotion and acoustic environment of the acoustic prompt in synthesis. See
https://aka.ms/valle for demos of our work.
Related papers
- Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model [11.62674351793]
We introduce a novel audio-based TTS model to adapt context features with multiple enhancements.
Inspired by the success of Qformer, we propose a multi-modal context-enhanced Qformer.
Our proposed method outperforms baselines across various context TTS scenarios.
arXiv Detail & Related papers (2024-06-06T03:06:45Z) - Pheme: Efficient and Conversational Speech Generation [52.34331755341856]
We introduce the Pheme model series that offers compact yet high-performing conversational TTS models.
It can be trained efficiently on smaller-scale conversational data, cutting data demands by more than 10x but still matching the quality of the autoregressive TTS models.
arXiv Detail & Related papers (2024-01-05T14:47:20Z) - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus [3.1925030748447747]
We present a speech corpus for Classical Arabic Text-to-Speech (ClArTTS) to support the development of end-to-end TTS systems for Arabic.
The speech is extracted from a LibriVox audiobook, which is then processed, segmented, and manually transcribed and annotated.
The final ClArTTS corpus contains about 12 hours of speech from a single male speaker sampled at 40100 kHz.
arXiv Detail & Related papers (2023-02-28T20:18:59Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model.
It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech.
It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z) - Unsupervised TTS Acoustic Modeling for TTS with Conditional Disentangled Sequential VAE [36.50265124324876]
We propose a novel unsupervised text-to-speech acoustic model training scheme, named UTTS, which does not require text-audio pairs.
The framework offers a flexible choice of a speaker's duration model, timbre feature (identity) and content for TTS inference.
Experiments demonstrate that UTTS can synthesize speech of high naturalness and intelligibility measured by human and objective evaluations.
arXiv Detail & Related papers (2022-06-06T11:51:22Z) - Voice Cloning: a Multi-Speaker Text-to-Speech Synthesis Approach based
on Transfer Learning [0.802904964931021]
The proposed approach has the goal to overcome these limitations trying to obtain a system which is able to model a multi-speaker acoustic space.
This allows the generation of speech audio similar to the voice of different target speakers, even if they were not observed during the training phase.
arXiv Detail & Related papers (2021-02-10T18:43:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.