ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided
Sequence Reordering
- URL: http://arxiv.org/abs/2401.07333v1
- Date: Sun, 14 Jan 2024 17:43:55 GMT
- Title: ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided
Sequence Reordering
- Authors: Yakun Song, Zhuo Chen, Xiaofei Wang, Ziyang Ma, Xie Chen
- Abstract summary: ELLA-V is a text-to-speech framework that enables fine-grained control over synthesized audio at the phoneme level.
Our model outperforms VALL-E in terms of accuracy and delivers more stable results using both greedy and sampling-based decoding strategies.
- Score: 9.646664943647208
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The language model (LM) approach based on acoustic and linguistic prompts,
such as VALL-E, has achieved remarkable progress in the field of zero-shot
audio generation. However, existing methods still have some limitations: 1)
repetitions, transpositions, and omissions in the output synthesized speech due
to limited alignment constraints between audio and phoneme tokens; 2)
challenges of fine-grained control over the synthesized speech with
autoregressive (AR) language model; 3) infinite silence generation due to the
nature of AR-based decoding, especially under the greedy strategy. To alleviate
these issues, we propose ELLA-V, a simple but efficient LM-based zero-shot
text-to-speech (TTS) framework, which enables fine-grained control over
synthesized audio at the phoneme level. The key to ELLA-V is interleaving
sequences of acoustic and phoneme tokens, where phoneme tokens appear ahead of
the corresponding acoustic tokens. The experimental findings reveal that our
model outperforms VALL-E in terms of accuracy and delivers more stable results
using both greedy and sampling-based decoding strategies. The code of ELLA-V
will be open-sourced after cleanups. Audio samples are available at
https://ereboas.github.io/ELLAV/.
Related papers
- Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis [7.2129341612013285]
We introduce Lina-Speech, a model that replaces traditional self-attention mechanisms with emerging recurrent architectures like Gated Linear Attention (GLA)
This approach is fast, easy to deploy, and achieves performance comparable to fine-tuned baselines when the dataset size ranges from 3 to 15 minutes.
arXiv Detail & Related papers (2024-10-30T04:50:40Z) - CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder.
Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z) - VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment [101.2489492032816]
VALL-E R is a robust and efficient zero-shot Text-to-Speech system.
This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia.
arXiv Detail & Related papers (2024-06-12T04:09:44Z) - VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers [119.89284877061779]
This paper introduces VALL-E 2, the latest advancement in neural language models that marks a milestone in zero-shot text-to-speech (TTS)
VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases.
The advantages of this work could contribute to valuable endeavors, such as generating speech for individuals with aphasia or people with amyotrophic lateral sclerosis.
arXiv Detail & Related papers (2024-06-08T06:31:03Z) - LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT [65.69648099999439]
Generative Pre-trained Transformer (GPT) models have achieved remarkable performance on various natural language processing tasks.
We propose LauraGPT, a novel unified audio-and-text GPT-based LLM for audio recognition, understanding, and generation.
arXiv Detail & Related papers (2023-10-07T03:17:59Z) - Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM [19.36630667212398]
We present Spectron, a novel approach to adapting pre-trained large language models (LLMs) to perform spoken question answering (QA) and speech continuation.
Key to our approach is a training objective that jointly supervises speech recognition, text continuation, and speech synthesis.
Our method surpasses existing spoken language models in speaker preservation and semantic coherence.
arXiv Detail & Related papers (2023-05-24T15:39:43Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.