Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models
- URL: http://arxiv.org/abs/2301.12661v1
- Date: Mon, 30 Jan 2023 04:44:34 GMT
- Title: Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models
- Authors: Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze
Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, Zhou Zhao
- Abstract summary: multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
- Score: 65.18102159618631
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale multimodal generative modeling has created milestones in
text-to-image and text-to-video generation. Its application to audio still lags
behind for two main reasons: the lack of large-scale datasets with high-quality
text-audio pairs, and the complexity of modeling long continuous audio data. In
this work, we propose Make-An-Audio with a prompt-enhanced diffusion model that
addresses these gaps by 1) introducing pseudo prompt enhancement with a
distill-then-reprogram approach, it alleviates data scarcity with orders of
magnitude concept compositions by using language-free audios; 2) leveraging
spectrogram autoencoder to predict the self-supervised audio representation
instead of waveforms. Together with robust contrastive language-audio
pretraining (CLAP) representations, Make-An-Audio achieves state-of-the-art
results in both objective and subjective benchmark evaluation. Moreover, we
present its controllability and generalization for X-to-Audio with "No Modality
Left Behind", for the first time unlocking the ability to generate
high-definition, high-fidelity audios given a user-defined modality input.
Audio samples are available at https://Text-to-Audio.github.io
Related papers
- Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization [70.13218512896032]
Generation of audio from text prompts is an important aspect of such processes in the music and film industry.
Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data.
We synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from.
arXiv Detail & Related papers (2024-04-15T17:31:22Z) - Audiobox: Unified Audio Generation with Natural Language Prompts [37.39834044113061]
This paper presents Audiobox, a unified model based on flow-matching that is capable of generating various audio modalities.
We design description-based and example-based prompting to enhance controllability and unify speech and sound generation paradigms.
Audiobox sets new benchmarks on speech and sound generation and unlocks new methods for generating audio with novel vocal and acoustic styles.
arXiv Detail & Related papers (2023-12-25T22:24:49Z) - Retrieval-Augmented Text-to-Audio Generation [36.328134891428085]
We show that the state-of-the-art models, such as AudioLDM, are biased in their generation performance.
We propose a simple retrieval-augmented approach for TTA models.
We show that Re-AudioLDM can generate realistic audio for complex scenes, rare audio classes, and even unseen audio types.
arXiv Detail & Related papers (2023-09-14T22:35:39Z) - AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining [46.22290575167155]
This paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation.
Our framework introduces a general representation of audio, called "language of audio" (LOA)
arXiv Detail & Related papers (2023-08-10T17:55:13Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech
Synthesis [77.06890315052563]
We propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency.
Experiments show that our model achieves $19.76times$ speedup for audio generation compared with the current autoregressive model on input sequences of 3 seconds.
arXiv Detail & Related papers (2022-07-08T10:10:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.