Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale
- URL: http://arxiv.org/abs/2306.15687v2
- Date: Thu, 19 Oct 2023 13:23:28 GMT
- Title: Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale
- Authors: Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel
Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, Wei-Ning
Hsu
- Abstract summary: Voicebox is the most versatile text-guided generative model for speech at scale.
It can be used for mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation.
It outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster.
- Score: 58.46845567087977
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale generative models such as GPT and DALL-E have revolutionized the
research community. These models not only generate high fidelity outputs, but
are also generalists which can solve tasks not explicitly taught. In contrast,
speech generative models are still primitive in terms of scale and task
generalization. In this paper, we present Voicebox, the most versatile
text-guided generative model for speech at scale. Voicebox is a
non-autoregressive flow-matching model trained to infill speech, given audio
context and text, trained on over 50K hours of speech that are not filtered or
enhanced. Similar to GPT, Voicebox can perform many different tasks through
in-context learning, but is more flexible as it can also condition on future
context. Voicebox can be used for mono or cross-lingual zero-shot
text-to-speech synthesis, noise removal, content editing, style conversion, and
diverse sample generation. In particular, Voicebox outperforms the
state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9% vs
1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to
20 times faster. Audio samples can be found in
\url{https://voicebox.metademolab.com}.
Related papers
- Zero-Shot vs. Few-Shot Multi-Speaker TTS Using Pre-trained Czech SpeechT5 Model [3.462371782084948]
We show that SpeechT5 model can generate a synthetic voice for any speaker using only one minute of the target speaker's data.
We successfully demonstrated the high quality and similarity of our synthetic voices on publicly known Czech politicians and celebrities.
arXiv Detail & Related papers (2024-07-24T11:14:06Z) - FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs [63.8261207950923]
FunAudioLLM is a model family designed to enhance natural voice interactions between humans and large language models (LLMs)
At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity.
The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub.
arXiv Detail & Related papers (2024-07-04T16:49:02Z) - Audiobox: Unified Audio Generation with Natural Language Prompts [37.39834044113061]
This paper presents Audiobox, a unified model based on flow-matching that is capable of generating various audio modalities.
We design description-based and example-based prompting to enhance controllability and unify speech and sound generation paradigms.
Audiobox sets new benchmarks on speech and sound generation and unlocks new methods for generating audio with novel vocal and acoustic styles.
arXiv Detail & Related papers (2023-12-25T22:24:49Z) - TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head
Translation [54.155138561698514]
Direct speech-to-speech translation achieves high-quality results through the introduction of discrete units obtained from self-supervised learning.
Existing methods invariably rely on cascading, synthesizing via both audio and text, resulting in delays and cascading errors.
We propose a model for talking head translation, textbfTransFace, which can directly translate audio-visual speech into audio-visual speech in other languages.
arXiv Detail & Related papers (2023-12-23T08:45:57Z) - VoiceLDM: Text-to-Speech with Environmental Context [22.29992463094861]
VoiceLDM is a model designed to produce audio that accurately follows two distinct natural language text prompts.
By utilizing pretrained contrastive language-audio pretraining (CLAP) and Whisper, VoiceLDM is trained on large amounts of real-world audio without manual annotations or transcriptions.
We show that VoiceLDM is capable of generating plausible audio that aligns well with both input conditions, even surpassing the speech intelligibility of the ground truth audio on the AudioCaps test set.
arXiv Detail & Related papers (2023-09-24T15:20:59Z) - SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks.
Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.