AudioGen: Textually Guided Audio Generation
- URL: http://arxiv.org/abs/2209.15352v1
- Date: Fri, 30 Sep 2022 10:17:05 GMT
- Title: AudioGen: Textually Guided Audio Generation
- Authors: Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre
D\'efossez, Jade Copet, Devi Parikh, Yaniv Taigman, Yossi Adi
- Abstract summary: We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
- Score: 116.57006301417306
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We tackle the problem of generating audio samples conditioned on descriptive
text captions. In this work, we propose AaudioGen, an auto-regressive
generative model that generates audio samples conditioned on text inputs.
AudioGen operates on a learnt discrete audio representation. The task of
text-to-audio generation poses multiple challenges. Due to the way audio
travels through a medium, differentiating ``objects'' can be a difficult task
(e.g., separating multiple people simultaneously speaking). This is further
complicated by real-world recording conditions (e.g., background noise,
reverberation, etc.). Scarce text annotations impose another constraint,
limiting the ability to scale models. Finally, modeling high-fidelity audio
requires encoding audio at high sampling rate, leading to extremely long
sequences. To alleviate the aforementioned challenges we propose an
augmentation technique that mixes different audio samples, driving the model to
internally learn to separate multiple sources. We curated 10 datasets
containing different types of audio and text annotations to handle the scarcity
of text-audio data points. For faster inference, we explore the use of
multi-stream modeling, allowing the use of shorter sequences while maintaining
a similar bitrate and perceptual quality. We apply classifier-free guidance to
improve adherence to text. Comparing to the evaluated baselines, AudioGen
outperforms over both objective and subjective metrics. Finally, we explore the
ability of the proposed method to generate audio continuation conditionally and
unconditionally. Samples: https://tinyurl.com/audiogen-text2audio
Related papers
- Improving Text-To-Audio Models with Synthetic Captions [51.19111942748637]
We propose an audio captioning pipeline that uses an textitaudio language model to synthesize accurate and diverse captions for audio at scale.
We leverage this pipeline to produce a dataset of synthetic captions for AudioSet, named textttAF-AudioSet, and then evaluate the benefit of pre-training text-to-audio models on these synthetic captions.
arXiv Detail & Related papers (2024-06-18T00:02:15Z) - Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization [70.13218512896032]
Generation of audio from text prompts is an important aspect of such processes in the music and film industry.
Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data.
We synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from.
arXiv Detail & Related papers (2024-04-15T17:31:22Z) - Audiobox: Unified Audio Generation with Natural Language Prompts [37.39834044113061]
This paper presents Audiobox, a unified model based on flow-matching that is capable of generating various audio modalities.
We design description-based and example-based prompting to enhance controllability and unify speech and sound generation paradigms.
Audiobox sets new benchmarks on speech and sound generation and unlocks new methods for generating audio with novel vocal and acoustic styles.
arXiv Detail & Related papers (2023-12-25T22:24:49Z) - Retrieval-Augmented Text-to-Audio Generation [36.328134891428085]
We show that the state-of-the-art models, such as AudioLDM, are biased in their generation performance.
We propose a simple retrieval-augmented approach for TTA models.
We show that Re-AudioLDM can generate realistic audio for complex scenes, rare audio classes, and even unseen audio types.
arXiv Detail & Related papers (2023-09-14T22:35:39Z) - WavJourney: Compositional Audio Creation with Large Language Models [38.39551216587242]
We present WavJourney, a novel framework that leverages Large Language Models to connect various audio models for audio creation.
WavJourney allows users to create storytelling audio content with diverse audio elements simply from textual descriptions.
We show that WavJourney is capable of synthesizing realistic audio aligned with textually-described semantic, spatial and temporal conditions.
arXiv Detail & Related papers (2023-07-26T17:54:04Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.