ArchiSound: Audio Generation with Diffusion
- URL: http://arxiv.org/abs/2301.13267v1
- Date: Mon, 30 Jan 2023 20:23:26 GMT
- Title: ArchiSound: Audio Generation with Diffusion
- Authors: Flavio Schneider
- Abstract summary: In this work, we investigate the potential of diffusion models for audio generation.
We propose a new method for text-conditional latent audio diffusion with stacked 1D U-Nets.
For each model, we make an effort to maintain reasonable inference speed, targeting real-time on a single consumer GPU.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recent surge in popularity of diffusion models for image generation has
brought new attention to the potential of these models in other areas of media
generation. One area that has yet to be fully explored is the application of
diffusion models to audio generation. Audio generation requires an
understanding of multiple aspects, such as the temporal dimension, long term
structure, multiple layers of overlapping sounds, and the nuances that only
trained listeners can detect. In this work, we investigate the potential of
diffusion models for audio generation. We propose a set of models to tackle
multiple aspects, including a new method for text-conditional latent audio
diffusion with stacked 1D U-Nets, that can generate multiple minutes of music
from a textual description. For each model, we make an effort to maintain
reasonable inference speed, targeting real-time on a single consumer GPU. In
addition to trained models, we provide a collection of open source libraries
with the hope of simplifying future work in the field. Samples can be found at
https://bit.ly/audio-diffusion. Codes are at
https://github.com/archinetai/audio-diffusion-pytorch.
Related papers
- Read, Watch and Scream! Sound Generation from Text and Video [23.990569918960315]
We propose a novel video-and-text-to-sound generation method called ReWaS.
Our method estimates the structural information of audio from the video while receiving key content cues from a user prompt.
By separating the generative components of audio, it becomes a more flexible system that allows users to freely adjust the energy, surrounding environment, and primary sound source according to their preferences.
arXiv Detail & Related papers (2024-07-08T01:59:17Z) - Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation [24.349512234085644]
This paper shows a simple and lightweight generative transformer, which is not fully investigated in multi-modal generation.
The transformer operates in the discrete audio and visual Vector-Quantized GAN space, and is trained in the mask denoising manner.
In the experiments, we show that our simple method surpasses recent image2audio generation methods.
arXiv Detail & Related papers (2024-05-23T14:13:16Z) - SonicDiffusion: Audio-Driven Image Generation and Editing with Pretrained Diffusion Models [21.669044026456557]
We propose a method to enable audio-conditioning in large scale image diffusion models.
In addition to audio conditioned image generation, our method can also be utilized in conjuction with diffusion based editing methods.
arXiv Detail & Related papers (2024-05-01T21:43:57Z) - Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization [70.13218512896032]
Generation of audio from text prompts is an important aspect of such processes in the music and film industry.
Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data.
We synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from.
arXiv Detail & Related papers (2024-04-15T17:31:22Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - Audiobox: Unified Audio Generation with Natural Language Prompts [37.39834044113061]
This paper presents Audiobox, a unified model based on flow-matching that is capable of generating various audio modalities.
We design description-based and example-based prompting to enhance controllability and unify speech and sound generation paradigms.
Audiobox sets new benchmarks on speech and sound generation and unlocks new methods for generating audio with novel vocal and acoustic styles.
arXiv Detail & Related papers (2023-12-25T22:24:49Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - Mo\^usai: Text-to-Music Generation with Long-Context Latent Diffusion [27.567536688166776]
We bridge text and music via a text-to-music generation model that is highly efficient, expressive, and can handle long-term structure.
Specifically, we develop Mousai, a cascading two-stage latent diffusion model that can generate multiple minutes of high-quality stereo music at 48kHz from textual descriptions.
arXiv Detail & Related papers (2023-01-27T14:52:53Z) - MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and
Video Generation [70.74377373885645]
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously.
MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design.
Experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks.
arXiv Detail & Related papers (2022-12-19T14:11:52Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.