Retrieval-Augmented Text-to-Audio Generation
- URL: http://arxiv.org/abs/2309.08051v2
- Date: Fri, 5 Jan 2024 14:10:51 GMT
- Title: Retrieval-Augmented Text-to-Audio Generation
- Authors: Yi Yuan, Haohe Liu, Xubo Liu, Qiushi Huang, Mark D. Plumbley, Wenwu
Wang
- Abstract summary: We show that the state-of-the-art models, such as AudioLDM, are biased in their generation performance.
We propose a simple retrieval-augmented approach for TTA models.
We show that Re-AudioLDM can generate realistic audio for complex scenes, rare audio classes, and even unseen audio types.
- Score: 36.328134891428085
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite recent progress in text-to-audio (TTA) generation, we show that the
state-of-the-art models, such as AudioLDM, trained on datasets with an
imbalanced class distribution, such as AudioCaps, are biased in their
generation performance. Specifically, they excel in generating common audio
classes while underperforming in the rare ones, thus degrading the overall
generation performance. We refer to this problem as long-tailed text-to-audio
generation. To address this issue, we propose a simple retrieval-augmented
approach for TTA models. Specifically, given an input text prompt, we first
leverage a Contrastive Language Audio Pretraining (CLAP) model to retrieve
relevant text-audio pairs. The features of the retrieved audio-text data are
then used as additional conditions to guide the learning of TTA models. We
enhance AudioLDM with our proposed approach and denote the resulting augmented
system as Re-AudioLDM. On the AudioCaps dataset, Re-AudioLDM achieves a
state-of-the-art Frechet Audio Distance (FAD) of 1.37, outperforming the
existing approaches by a large margin. Furthermore, we show that Re-AudioLDM
can generate realistic audio for complex scenes, rare audio classes, and even
unseen audio types, indicating its potential in TTA tasks.
Related papers
- Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition [72.22243595269389]
We introduce Audio-Agent, a framework for audio generation, editing and composition based on text or video inputs.
For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with generated audio.
arXiv Detail & Related papers (2024-10-04T11:40:53Z) - AudioSetMix: Enhancing Audio-Language Datasets with LLM-Assisted Augmentations [1.2101820447447276]
Multi-modal learning in the audio-language domain has seen significant advancements in recent years.
However, audio-language learning faces challenges due to limited and lower-quality data compared to image-language tasks.
Our method systematically generates audio-caption pairs by augmenting audio clips with natural language labels and corresponding audio signal processing operations.
This scalable method produces AudioSetMix, a high-quality training dataset for text-and-audio related models.
arXiv Detail & Related papers (2024-05-17T21:08:58Z) - Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization [70.13218512896032]
Generation of audio from text prompts is an important aspect of such processes in the music and film industry.
Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data.
We synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from.
arXiv Detail & Related papers (2024-04-15T17:31:22Z) - AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining [46.22290575167155]
This paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation.
Our framework introduces a general representation of audio, called "language of audio" (LOA)
arXiv Detail & Related papers (2023-08-10T17:55:13Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - AudioLDM: Text-to-Audio Generation with Latent Diffusion Models [35.703877904270726]
Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions.
In this study, we propose AudioLDM, a TTA system that is built on a latent space to learn the continuous audio representations from contrastive language-audio pretraining latents.
Trained on AudioCaps with a single GPU, AudioLDM achieves state-of-the-art TTA performance measured by both objective and subjective metrics.
arXiv Detail & Related papers (2023-01-29T17:48:17Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.