Related papers: Music Consistency Models

Music Consistency Models

URL: http://arxiv.org/abs/2404.13358v1
Date: Sat, 20 Apr 2024 11:52:30 GMT
Title: Music Consistency Models
Authors: Zhengcong Fei, Mingyuan Fan, Junshi Huang,
Abstract summary: We present Music Consistency Models (textttMusicCM), which leverages the concept of consistency models to efficiently synthesize mel-spectrogram for music clips. Building upon existing text-to-music diffusion models, the textttMusicCM model incorporates consistency distillation and adversarial discriminator training. Experimental results reveal the effectiveness of our model in terms of computational efficiency, fidelity, and naturalness.
Score: 31.415900049111023
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Consistency models have exhibited remarkable capabilities in facilitating efficient image/video generation, enabling synthesis with minimal sampling steps. It has proven to be advantageous in mitigating the computational burdens associated with diffusion models. Nevertheless, the application of consistency models in music generation remains largely unexplored. To address this gap, we present Music Consistency Models (\texttt{MusicCM}), which leverages the concept of consistency models to efficiently synthesize mel-spectrogram for music clips, maintaining high quality while minimizing the number of sampling steps. Building upon existing text-to-music diffusion models, the \texttt{MusicCM} model incorporates consistency distillation and adversarial discriminator training. Moreover, we find it beneficial to generate extended coherent music by incorporating multiple diffusion processes with shared constraints. Experimental results reveal the effectiveness of our model in terms of computational efficiency, fidelity, and naturalness. Notable, \texttt{MusicCM} achieves seamless music synthesis with a mere four sampling steps, e.g., only one second per minute of the music clip, showcasing the potential for real-time application.

Related papers

Scaling Self-Supervised Representation Learning for Symbolic Piano Performance [52.661197827466886]
We study the capabilities of generative autoregressive transformer models trained on large amounts of symbolic solo-piano transcriptions.<n>We use a comparatively smaller, high-quality subset to finetune models to produce musical continuations, perform symbolic classification tasks, and produce general-purpose contrastive MIDI embeddings.
arXiv Detail & Related papers (2025-06-30T14:00:14Z)
Dual-Expert Consistency Model for Efficient and High-Quality Video Generation [57.33788820909211]
We propose a parameter-efficient textbfDual-Expert Consistency Model(DCM), where a semantic expert focuses on learning semantic layout and motion, while a detail expert specializes in fine detail refinement.<n>Our approach achieves state-of-the-art visual quality with significantly reduced sampling steps, demonstrating the effectiveness of expert specialization in video diffusion model distillation.
arXiv Detail & Related papers (2025-06-03T17:55:04Z)
MotionRAG-Diff: A Retrieval-Augmented Diffusion Framework for Long-Term Music-to-Dance Generation [10.203209816178552]
MotionRAG-Diff is a hybrid framework that integrates Retrieval-Augmented Generation and diffusion-based refinement.<n>Our method introduces three core innovations.<n>It achieves state-of-the-art performance in motion quality, diversity, and music-motion synchronization accuracy.
arXiv Detail & Related papers (2025-06-03T09:12:48Z)
One Step Diffusion via Shortcut Models [109.72495454280627]
We introduce shortcut models, a family of generative models that use a single network and training phase to produce high-quality samples. Shortcut models condition the network on the current noise level and also on the desired step size, allowing the model to skip ahead in the generation process. Compared to distillation, shortcut models reduce complexity to a single network and training phase and additionally allow varying step budgets at inference time.
arXiv Detail & Related papers (2024-10-16T13:34:40Z)
Efficient Fine-Grained Guidance for Diffusion-Based Symbolic Music Generation [14.156461396686248]
We introduce an efficient Fine-Grained Guidance (FGG) approach within diffusion models. FGG guides the diffusion models to generate music that aligns more closely with the control and intent of expert composers. This approach empowers diffusion models to excel in advanced applications such as improvisation, and interactive music creation.
arXiv Detail & Related papers (2024-10-11T00:41:46Z)
Diff-A-Riff: Musical Accompaniment Co-creation via Latent Diffusion Models [0.0]
"Diff-A-Riff" is a Latent Diffusion Model designed to generate high-quality instrumentals adaptable to any musical context. It produces 48kHz pseudo-stereo audio while significantly reducing inference time and memory usage.
arXiv Detail & Related papers (2024-06-12T16:34:26Z)
DITTO: Diffusion Inference-Time T-Optimization for Music Generation [49.90109850026932]
Diffusion Inference-Time T-Optimization (DITTO) is a frame-work for controlling pre-trained text-to-music diffusion models at inference-time. We demonstrate a surprisingly wide-range of applications for music generation including inpainting, outpainting, and looping as well as intensity, melody, and musical structure control.
arXiv Detail & Related papers (2024-01-22T18:10:10Z)
VideoLCM: Video Latent Consistency Model [52.3311704118393]
VideoLCM builds upon existing latent video diffusion models and incorporates consistency distillation techniques for training the latent consistency model. VideoLCM achieves high-fidelity and smooth video synthesis with only four sampling steps, showcasing the potential for real-time synthesis.
arXiv Detail & Related papers (2023-12-14T16:45:36Z)
Fast Diffusion GAN Model for Symbolic Music Generation Controlled by Emotions [1.6004393678882072]
We propose a diffusion model combined with a Generative Adversarial Network to generate discrete symbolic music. We first used a trained Variational Autoencoder to obtain embeddings of a symbolic music dataset with emotion labels. Our results demonstrate the successful control of our diffusion model to generate symbolic music with a desired emotion.
arXiv Detail & Related papers (2023-10-21T15:35:43Z)
Simultaneous Image-to-Zero and Zero-to-Noise: Diffusion Models with Analytical Image Attenuation [53.04220377034574]
We propose incorporating an analytical image attenuation process into the forward diffusion process for high-quality (un)conditioned image generation. Our method represents the forward image-to-noise mapping as simultaneous textitimage-to-zero mapping and textitzero-to-noise mapping. We have conducted experiments on unconditioned image generation, textite.g., CIFAR-10 and CelebA-HQ-256, and image-conditioned downstream tasks such as super-resolution, saliency detection, edge detection, and image inpainting.
arXiv Detail & Related papers (2023-06-23T18:08:00Z)
Taming Diffusion Models for Music-driven Conducting Motion Generation [1.0624606551524207]
This paper presents Diffusion-Conductor, a novel DDIM-based approach for music-driven conducting motion generation. We propose a random masking strategy to improve the feature robustness, and use a pair of geometric loss functions to impose additional regularizations. We also design several novel metrics, including Frechet Gesture Distance (FGD) and Beat Consistency Score (BC) for a more comprehensive evaluation of the generated motion.
arXiv Detail & Related papers (2023-06-15T03:49:24Z)
HiddenSinger: High-Quality Singing Voice Synthesis via Neural Audio Codec and Latent Diffusion Models [25.966328901566815]
We propose HiddenSinger, a high-quality singing voice synthesis system using neural audio and latent diffusion models. In addition, our proposed model is extended to an unsupervised singing voice learning framework, HiddenSinger-U, to train the model. Experimental results demonstrate that our model outperforms previous models in terms of audio quality.
arXiv Detail & Related papers (2023-06-12T01:21:41Z)
Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality. To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches. Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z)
DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score. The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z)
Hybrid modeling: Applications in real-time diagnosis [64.5040763067757]
We outline a novel hybrid modeling approach that combines machine learning inspired models and physics-based models. We are using such models for real-time diagnosis applications.
arXiv Detail & Related papers (2020-03-04T00:44:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.