Audiobox: Unified Audio Generation with Natural Language Prompts
- URL: http://arxiv.org/abs/2312.15821v1
- Date: Mon, 25 Dec 2023 22:24:49 GMT
- Title: Audiobox: Unified Audio Generation with Natural Language Prompts
- Authors: Apoorv Vyas, Bowen Shi, Matthew Le, Andros Tjandra, Yi-Chiao Wu,
Baishan Guo, Jiemin Zhang, Xinyue Zhang, Robert Adkins, William Ngan, Jeff
Wang, Ivan Cruz, Bapi Akula, Akinniyi Akinyemi, Brian Ellis, Rashel Moritz,
Yael Yungster, Alice Rakotoarison, Liang Tan, Chris Summers, Carleigh Wood,
Joshua Lane, Mary Williamson, Wei-Ning Hsu
- Abstract summary: This paper presents Audiobox, a unified model based on flow-matching that is capable of generating various audio modalities.
We design description-based and example-based prompting to enhance controllability and unify speech and sound generation paradigms.
Audiobox sets new benchmarks on speech and sound generation and unlocks new methods for generating audio with novel vocal and acoustic styles.
- Score: 37.39834044113061
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio is an essential part of our life, but creating it often requires
expertise and is time-consuming. Research communities have made great progress
over the past year advancing the performance of large scale audio generative
models for a single modality (speech, sound, or music) through adopting more
powerful generative models and scaling data. However, these models lack
controllability in several aspects: speech generation models cannot synthesize
novel styles based on text description and are limited on domain coverage such
as outdoor environments; sound generation models only provide coarse-grained
control based on descriptions like "a person speaking" and would only generate
mumbling human voices. This paper presents Audiobox, a unified model based on
flow-matching that is capable of generating various audio modalities. We design
description-based and example-based prompting to enhance controllability and
unify speech and sound generation paradigms. We allow transcript, vocal, and
other audio styles to be controlled independently when generating speech. To
improve model generalization with limited labels, we adapt a self-supervised
infilling objective to pre-train on large quantities of unlabeled audio.
Audiobox sets new benchmarks on speech and sound generation (0.745 similarity
on Librispeech for zero-shot TTS; 0.77 FAD on AudioCaps for text-to-sound) and
unlocks new methods for generating audio with novel vocal and acoustic styles.
We further integrate Bespoke Solvers, which speeds up generation by over 25
times compared to the default ODE solver for flow-matching, without loss of
performance on several tasks. Our demo is available at
https://audiobox.metademolab.com/
Related papers
- Read, Watch and Scream! Sound Generation from Text and Video [23.990569918960315]
We propose a novel video-and-text-to-sound generation method called ReWaS.
Our method estimates the structural information of audio from the video while receiving key content cues from a user prompt.
By separating the generative components of audio, it becomes a more flexible system that allows users to freely adjust the energy, surrounding environment, and primary sound source according to their preferences.
arXiv Detail & Related papers (2024-07-08T01:59:17Z) - Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization [70.13218512896032]
Generation of audio from text prompts is an important aspect of such processes in the music and film industry.
Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data.
We synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from.
arXiv Detail & Related papers (2024-04-15T17:31:22Z) - SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks.
Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z) - AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining [46.22290575167155]
This paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation.
Our framework introduces a general representation of audio, called "language of audio" (LOA)
arXiv Detail & Related papers (2023-08-10T17:55:13Z) - WavJourney: Compositional Audio Creation with Large Language Models [38.39551216587242]
We present WavJourney, a novel framework that leverages Large Language Models to connect various audio models for audio creation.
WavJourney allows users to create storytelling audio content with diverse audio elements simply from textual descriptions.
We show that WavJourney is capable of synthesizing realistic audio aligned with textually-described semantic, spatial and temporal conditions.
arXiv Detail & Related papers (2023-07-26T17:54:04Z) - Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale [58.46845567087977]
Voicebox is the most versatile text-guided generative model for speech at scale.
It can be used for mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation.
It outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster.
arXiv Detail & Related papers (2023-06-23T16:23:24Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.