Music Consistency Models
- URL: http://arxiv.org/abs/2404.13358v1
- Date: Sat, 20 Apr 2024 11:52:30 GMT
- Title: Music Consistency Models
- Authors: Zhengcong Fei, Mingyuan Fan, Junshi Huang,
- Abstract summary: We present Music Consistency Models (textttMusicCM), which leverages the concept of consistency models to efficiently synthesize mel-spectrogram for music clips.
Building upon existing text-to-music diffusion models, the textttMusicCM model incorporates consistency distillation and adversarial discriminator training.
Experimental results reveal the effectiveness of our model in terms of computational efficiency, fidelity, and naturalness.
- Score: 31.415900049111023
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Consistency models have exhibited remarkable capabilities in facilitating efficient image/video generation, enabling synthesis with minimal sampling steps. It has proven to be advantageous in mitigating the computational burdens associated with diffusion models. Nevertheless, the application of consistency models in music generation remains largely unexplored. To address this gap, we present Music Consistency Models (\texttt{MusicCM}), which leverages the concept of consistency models to efficiently synthesize mel-spectrogram for music clips, maintaining high quality while minimizing the number of sampling steps. Building upon existing text-to-music diffusion models, the \texttt{MusicCM} model incorporates consistency distillation and adversarial discriminator training. Moreover, we find it beneficial to generate extended coherent music by incorporating multiple diffusion processes with shared constraints. Experimental results reveal the effectiveness of our model in terms of computational efficiency, fidelity, and naturalness. Notable, \texttt{MusicCM} achieves seamless music synthesis with a mere four sampling steps, e.g., only one second per minute of the music clip, showcasing the potential for real-time application.
Related papers
- One Step Diffusion via Shortcut Models [109.72495454280627]
We introduce shortcut models, a family of generative models that use a single network and training phase to produce high-quality samples.
Shortcut models condition the network on the current noise level and also on the desired step size, allowing the model to skip ahead in the generation process.
Compared to distillation, shortcut models reduce complexity to a single network and training phase and additionally allow varying step budgets at inference time.
arXiv Detail & Related papers (2024-10-16T13:34:40Z) - Efficient Fine-Grained Guidance for Diffusion-Based Symbolic Music Generation [14.156461396686248]
We introduce an efficient Fine-Grained Guidance (FGG) approach within diffusion models.
FGG guides the diffusion models to generate music that aligns more closely with the control and intent of expert composers.
This approach empowers diffusion models to excel in advanced applications such as improvisation, and interactive music creation.
arXiv Detail & Related papers (2024-10-11T00:41:46Z) - Diff-A-Riff: Musical Accompaniment Co-creation via Latent Diffusion Models [0.0]
"Diff-A-Riff" is a Latent Diffusion Model designed to generate high-quality instrumentals adaptable to any musical context.
It produces 48kHz pseudo-stereo audio while significantly reducing inference time and memory usage.
arXiv Detail & Related papers (2024-06-12T16:34:26Z) - DITTO: Diffusion Inference-Time T-Optimization for Music Generation [49.90109850026932]
Diffusion Inference-Time T-Optimization (DITTO) is a frame-work for controlling pre-trained text-to-music diffusion models at inference-time.
We demonstrate a surprisingly wide-range of applications for music generation including inpainting, outpainting, and looping as well as intensity, melody, and musical structure control.
arXiv Detail & Related papers (2024-01-22T18:10:10Z) - VideoLCM: Video Latent Consistency Model [52.3311704118393]
VideoLCM builds upon existing latent video diffusion models and incorporates consistency distillation techniques for training the latent consistency model.
VideoLCM achieves high-fidelity and smooth video synthesis with only four sampling steps, showcasing the potential for real-time synthesis.
arXiv Detail & Related papers (2023-12-14T16:45:36Z) - Fast Diffusion GAN Model for Symbolic Music Generation Controlled by
Emotions [1.6004393678882072]
We propose a diffusion model combined with a Generative Adversarial Network to generate discrete symbolic music.
We first used a trained Variational Autoencoder to obtain embeddings of a symbolic music dataset with emotion labels.
Our results demonstrate the successful control of our diffusion model to generate symbolic music with a desired emotion.
arXiv Detail & Related papers (2023-10-21T15:35:43Z) - Simultaneous Image-to-Zero and Zero-to-Noise: Diffusion Models with Analytical Image Attenuation [53.04220377034574]
We propose incorporating an analytical image attenuation process into the forward diffusion process for high-quality (un)conditioned image generation.
Our method represents the forward image-to-noise mapping as simultaneous textitimage-to-zero mapping and textitzero-to-noise mapping.
We have conducted experiments on unconditioned image generation, textite.g., CIFAR-10 and CelebA-HQ-256, and image-conditioned downstream tasks such as super-resolution, saliency detection, edge detection, and image inpainting.
arXiv Detail & Related papers (2023-06-23T18:08:00Z) - Taming Diffusion Models for Music-driven Conducting Motion Generation [1.0624606551524207]
This paper presents Diffusion-Conductor, a novel DDIM-based approach for music-driven conducting motion generation.
We propose a random masking strategy to improve the feature robustness, and use a pair of geometric loss functions to impose additional regularizations.
We also design several novel metrics, including Frechet Gesture Distance (FGD) and Beat Consistency Score (BC) for a more comprehensive evaluation of the generated motion.
arXiv Detail & Related papers (2023-06-15T03:49:24Z) - Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality.
To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches.
Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z) - DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score.
The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.