Noise2Music: Text-conditioned Music Generation with Diffusion Models
- URL: http://arxiv.org/abs/2302.03917v1
- Date: Wed, 8 Feb 2023 07:27:27 GMT
- Title: Noise2Music: Text-conditioned Music Generation with Diffusion Models
- Authors: Qingqing Huang, Daniel S. Park, Tao Wang, Timo I. Denk, Andy Ly,
Nanxin Chen, Zhengdong Zhang, Zhishuai Zhang, Jiahui Yu, Christian Frank,
Jesse Engel, Quoc V. Le, William Chan, Wei Han
- Abstract summary: We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts.
We find that the generated audio is not only able to faithfully reflect key elements of the text prompt such as genre, tempo, instruments, mood, and era.
Pretrained large language models play a key role in this story -- they are used to generate paired text for the audio of the training set and to extract embeddings of the text prompts ingested by the diffusion models.
- Score: 73.74580231353684
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We introduce Noise2Music, where a series of diffusion models is trained to
generate high-quality 30-second music clips from text prompts. Two types of
diffusion models, a generator model, which generates an intermediate
representation conditioned on text, and a cascader model, which generates
high-fidelity audio conditioned on the intermediate representation and possibly
the text, are trained and utilized in succession to generate high-fidelity
music. We explore two options for the intermediate representation, one using a
spectrogram and the other using audio with lower fidelity. We find that the
generated audio is not only able to faithfully reflect key elements of the text
prompt such as genre, tempo, instruments, mood, and era, but goes beyond to
ground fine-grained semantics of the prompt. Pretrained large language models
play a key role in this story -- they are used to generate paired text for the
audio of the training set and to extract embeddings of the text prompts
ingested by the diffusion models.
Generated examples: https://google-research.github.io/noise2music
Related papers
- Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization [70.13218512896032]
Generation of audio from text prompts is an important aspect of such processes in the music and film industry.
Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data.
We synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from.
arXiv Detail & Related papers (2024-04-15T17:31:22Z) - AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining [46.22290575167155]
This paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation.
Our framework introduces a general representation of audio, called "language of audio" (LOA)
arXiv Detail & Related papers (2023-08-10T17:55:13Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - ArchiSound: Audio Generation with Diffusion [0.0]
In this work, we investigate the potential of diffusion models for audio generation.
We propose a new method for text-conditional latent audio diffusion with stacked 1D U-Nets.
For each model, we make an effort to maintain reasonable inference speed, targeting real-time on a single consumer GPU.
arXiv Detail & Related papers (2023-01-30T20:23:26Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - Mo\^usai: Text-to-Music Generation with Long-Context Latent Diffusion [27.567536688166776]
We bridge text and music via a text-to-music generation model that is highly efficient, expressive, and can handle long-term structure.
Specifically, we develop Mousai, a cascading two-stage latent diffusion model that can generate multiple minutes of high-quality stereo music at 48kHz from textual descriptions.
arXiv Detail & Related papers (2023-01-27T14:52:53Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - A Unified Model for Zero-shot Music Source Separation, Transcription and
Synthesis [13.263771543118994]
We propose a unified model for three inter-related tasks: 1) to textitseparate individual sound sources from a mixed music audio, 2) to textittranscribe each sound source to MIDI notes, and 3) totextit synthesize new pieces based on the timbre of separated sources.
The model is inspired by the fact that when humans listen to music, our minds can not only separate the sounds of different instruments, but also at the same time perceive high-level representations such as score and timbre.
arXiv Detail & Related papers (2021-08-07T14:28:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.