Fast Timing-Conditioned Latent Audio Diffusion
- URL: http://arxiv.org/abs/2402.04825v3
- Date: Mon, 13 May 2024 14:05:00 GMT
- Title: Fast Timing-Conditioned Latent Audio Diffusion
- Authors: Zach Evans, CJ Carr, Josiah Taylor, Scott H. Hawley, Jordi Pons,
- Abstract summary: Stable Audio is capable of rendering stereo signals of up to 95 sec at 44.1kHz in 8 sec on an A100 GPU.
It is one of the best in two public text-to-music and -audio benchmarks and, differently from state-of-the-art models, can generate music with structure and stereo sounds.
- Score: 8.774733281142021
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating long-form 44.1kHz stereo audio from text prompts can be computationally demanding. Further, most previous works do not tackle that music and sound effects naturally vary in their duration. Our research focuses on the efficient generation of long-form, variable-length stereo music and sounds at 44.1kHz using text prompts with a generative model. Stable Audio is based on latent diffusion, with its latent defined by a fully-convolutional variational autoencoder. It is conditioned on text prompts as well as timing embeddings, allowing for fine control over both the content and length of the generated music and sounds. Stable Audio is capable of rendering stereo signals of up to 95 sec at 44.1kHz in 8 sec on an A100 GPU. Despite its compute efficiency and fast inference, it is one of the best in two public text-to-music and -audio benchmarks and, differently from state-of-the-art models, can generate music with structure and stereo sounds.
Related papers
- A2SB: Audio-to-Audio Schrodinger Bridges [52.637226461175096]
We present an audio restoration model for high-res music at 44.1kHz.
Our model, Audio-to-Audio Schrodinger Bridges (A2SB), is capable of both bandwidth extension and inpainting.
arXiv Detail & Related papers (2025-01-20T07:28:41Z) - Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis [56.01110988816489]
We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio.
MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples.
MMAudio achieves surprisingly competitive performance in text-to-audio generation, showing that joint training does not hinder single-modality performance.
arXiv Detail & Related papers (2024-12-19T18:59:55Z) - Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition [72.22243595269389]
We introduce Audio-Agent, a framework for audio generation, editing and composition based on text or video inputs.
In our method, we utilize a pre-trained TTA diffusion network as the audio generation agent to work in tandem with GPT-4.
For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with the generated audio.
arXiv Detail & Related papers (2024-10-04T11:40:53Z) - Long-form music generation with latent diffusion [8.799402694043955]
We show that by training a generative model on long temporal contexts it is possible to produce long-form music of up to 4m45s.
We obtains state-of-the-art generations according to metrics on audio quality and prompt alignment, and subjective tests reveal that it produces full-length music with coherent structure.
arXiv Detail & Related papers (2024-04-16T06:09:33Z) - Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization [70.13218512896032]
Generation of audio from text prompts is an important aspect of such processes in the music and film industry.
Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data.
We synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from.
arXiv Detail & Related papers (2024-04-15T17:31:22Z) - AudioSR: Versatile Audio Super-resolution at Scale [32.36683443201372]
We introduce a diffusion-based generative model, AudioSR, that is capable of performing robust audio super-resolution on versatile audio types.
Specifically, AudioSR can upsample any input audio signal within the bandwidth range of 2kHz to 16kHz to a high-resolution audio signal at 24kHz bandwidth.
arXiv Detail & Related papers (2023-09-13T21:00:09Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - MP3net: coherent, minute-long music generation from raw audio with a
simple convolutional GAN [0.0]
We present a deep convolutional GAN which produces high-quality audio samples with long-range coherence.
We leverage the auditory masking and psychoacoustic perception limit of the human ear to widen the true distribution.
We use MP3net to create 95s stereo tracks with a 22kHz sample rate after training for 250h on a single Cloud TPUv2.
arXiv Detail & Related papers (2021-01-12T22:37:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.