ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models
- URL: http://arxiv.org/abs/2302.04456v2
- Date: Thu, 21 Sep 2023 09:30:00 GMT
- Title: ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models
- Authors: Pengfei Zhu, Chao Pang, Yekun Chai, Lei Li, Shuohuan Wang, Yu Sun, Hao
Tian, Hua Wu
- Abstract summary: This paper introduces a text-to-waveform music generation model, underpinned by the utilization of diffusion models.
Our methodology hinges on the innovative incorporation of free-form textual prompts as conditional factors to guide the waveform generation process.
We demonstrate that our generated music in the waveform domain outperforms previous works by a large margin in terms of diversity, quality, and text-music relevance.
- Score: 67.66825818489406
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, the burgeoning interest in diffusion models has led to
significant advances in image and speech generation. Nevertheless, the direct
synthesis of music waveforms from unrestricted textual prompts remains a
relatively underexplored domain. In response to this lacuna, this paper
introduces a pioneering contribution in the form of a text-to-waveform music
generation model, underpinned by the utilization of diffusion models. Our
methodology hinges on the innovative incorporation of free-form textual prompts
as conditional factors to guide the waveform generation process within the
diffusion model framework. Addressing the challenge of limited text-music
parallel data, we undertake the creation of a dataset by harnessing web
resources, a task facilitated by weak supervision techniques. Furthermore, a
rigorous empirical inquiry is undertaken to contrast the efficacy of two
distinct prompt formats for text conditioning, namely, music tags and
unconstrained textual descriptions. The outcomes of this comparative analysis
affirm the superior performance of our proposed model in terms of enhancing
text-music relevance. Finally, our work culminates in a demonstrative
exhibition of the excellent capabilities of our model in text-to-music
generation. We further demonstrate that our generated music in the waveform
domain outperforms previous works by a large margin in terms of diversity,
quality, and text-music relevance.
Related papers
- ExpGest: Expressive Speaker Generation Using Diffusion Model and Hybrid Audio-Text Guidance [11.207513771079705]
We introduce ExpGest, a novel framework leveraging synchronized text and audio information to generate expressive full-body gestures.
Unlike AdaIN or one-hot encoding methods, we design a noise emotion classifier for optimizing adversarial direction noise.
We show that ExpGest achieves more expressive, natural, and controllable global motion in speakers compared to state-of-the-art models.
arXiv Detail & Related papers (2024-10-12T07:01:17Z) - The Interpretation Gap in Text-to-Music Generation Models [1.2565093324944228]
We propose a framework to describe the musical interaction process, which includes expression, interpretation, and execution of controls.
We argue that the primary gap between existing text-to-music models and musicians lies in the interpretation stage, where models lack the ability to interpret controls from musicians.
arXiv Detail & Related papers (2024-07-14T20:51:08Z) - Contextualized Diffusion Models for Text-Guided Image and Video Generation [67.69171154637172]
Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing.
We propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample.
We generalize our model to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing.
arXiv Detail & Related papers (2024-02-26T15:01:16Z) - MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models [24.582948932985726]
This paper introduces a novel approach to the editing of music generated by text-to-music models.
Our method transforms text editing to textitlatent space manipulation while adding an extra constraint to enforce consistency.
Experimental results demonstrate superior performance over both zero-shot and certain supervised baselines in style and timbre transfer evaluations.
arXiv Detail & Related papers (2024-02-09T04:34:08Z) - StemGen: A music generation model that listens [9.489938613869864]
We present an alternative paradigm for producing music generation models that can listen and respond to musical context.
We describe how such a model can be constructed using a non-autoregressive, transformer-based model architecture.
The resulting model reaches the audio quality of state-of-the-art text-conditioned models, as well as exhibiting strong musical coherence with its context.
arXiv Detail & Related papers (2023-12-14T08:09:20Z) - A Survey on Audio Diffusion Models: Text To Speech Synthesis and
Enhancement in Generative AI [64.71397830291838]
Generative AI has demonstrated impressive performance in various fields, among which speech synthesis is an interesting direction.
With the diffusion model as the most popular generative model, numerous works have attempted two active tasks: text to speech and speech enhancement.
This work conducts a survey on audio diffusion model, which is complementary to existing surveys.
arXiv Detail & Related papers (2023-03-23T15:17:15Z) - Text-to-image Diffusion Models in Generative AI: A Survey [86.11421833017693]
This survey reviews the progress of diffusion models in generating images from text.
We discuss applications beyond image generation, such as text-guided generation for various modalities like videos, and text-guided image editing.
arXiv Detail & Related papers (2023-03-14T13:49:54Z) - Noise2Music: Text-conditioned Music Generation with Diffusion Models [73.74580231353684]
We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts.
We find that the generated audio is not only able to faithfully reflect key elements of the text prompt such as genre, tempo, instruments, mood, and era.
Pretrained large language models play a key role in this story -- they are used to generate paired text for the audio of the training set and to extract embeddings of the text prompts ingested by the diffusion models.
arXiv Detail & Related papers (2023-02-08T07:27:27Z) - Mo\^usai: Text-to-Music Generation with Long-Context Latent Diffusion [27.567536688166776]
We bridge text and music via a text-to-music generation model that is highly efficient, expressive, and can handle long-term structure.
Specifically, we develop Mousai, a cascading two-stage latent diffusion model that can generate multiple minutes of high-quality stereo music at 48kHz from textual descriptions.
arXiv Detail & Related papers (2023-01-27T14:52:53Z) - Exploring the Efficacy of Pre-trained Checkpoints in Text-to-Music
Generation Task [86.72661027591394]
We generate complete and semantically consistent symbolic music scores from text descriptions.
We explore the efficacy of using publicly available checkpoints for natural language processing in the task of text-to-music generation.
Our experimental results show that the improvement from using pre-trained checkpoints is statistically significant in terms of BLEU score and edit distance similarity.
arXiv Detail & Related papers (2022-11-21T07:19:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.