Related papers: Presto! Distilling Steps and Layers for Accelerating Music Generation

Presto! Distilling Steps and Layers for Accelerating Music Generation

URL: http://arxiv.org/abs/2410.05167v2
Date: Wed, 16 Apr 2025 17:37:06 GMT
Title: Presto! Distilling Steps and Layers for Accelerating Music Generation
Authors: Zachary Novack, Ge Zhu, Jonah Casebeer, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas J. Bryan,
Abstract summary: Presto! is an approach to inference acceleration for score-based diffusion transformers.<n>We develop a new score-based distribution matching distillation (DMD) method for the EDM-family of diffusion models.<n>To reduce the cost per step, we develop a simple, but powerful improvement to a recent layer distillation method.
Score: 49.34961693154768
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite advances in diffusion-based text-to-music (TTM) methods, efficient, high-quality generation remains a challenge. We introduce Presto!, an approach to inference acceleration for score-based diffusion transformers via reducing both sampling steps and cost per step. To reduce steps, we develop a new score-based distribution matching distillation (DMD) method for the EDM-family of diffusion models, the first GAN-based distillation method for TTM. To reduce the cost per step, we develop a simple, but powerful improvement to a recent layer distillation method that improves learning via better preserving hidden state variance. Finally, we combine our step and layer distillation methods together for a dual-faceted approach. We evaluate our step and layer distillation methods independently and show each yield best-in-class performance. Our combined distillation method can generate high-quality outputs with improved diversity, accelerating our base model by 10-18x (230/435ms latency for 32 second mono/stereo 44.1kHz, 15x faster than comparable SOTA) -- the fastest high-quality TTM to our knowledge. Sound examples can be found at https://presto-music.github.io/web/.

Related papers

Transition Matching Distillation for Fast Video Generation [63.1049790376783]
We present Transition Matching Distillation (TMD), a novel framework for distilling video diffusion models into efficient few-step generators.<n>TMD matches the multi-step denoising trajectory of a diffusion model with a few-step probability transition process.<n>TMD provides a flexible and strong trade-off between generation speed and visual quality.
arXiv Detail & Related papers (2026-01-14T21:30:03Z)
Distribution Matching Distillation Meets Reinforcement Learning [30.960105413888943]
We propose DMDR, a novel framework that combines Reinforcement Learning (RL) techniques into the distillation process.<n>We show that for the RL of the few-step generator, the DMD loss itself is a more effective regularization compared to the traditional ones.<n>Experiments demonstrate that DMDR can achieve leading visual quality, prompt coherence among few-step methods, and even exhibit performance that exceeds the multi-step teacher.
arXiv Detail & Related papers (2025-11-17T17:59:54Z)
Adversarial Distribution Matching for Diffusion Distillation Towards Efficient Image and Video Synthesis [65.77083310980896]
We propose Adrial Distribution Matching (ADM) to align latent predictions between real and fake score estimators for score distillation.<n>Our proposed method achieves superior one-step performance on SDXL compared to DMD2 while consuming less GPU time.<n>Additional experiments that apply multi-step ADM distillation on SD3-Medium, SD3.5-Large, and CogVideoX set a new benchmark towards efficient image and video synthesis.
arXiv Detail & Related papers (2025-07-24T16:45:05Z)
Learning Few-Step Diffusion Models by Trajectory Distribution Matching [18.229753357571116]
Trajectory Distribution Matching (TDM) is a unified distillation paradigm that combines the strengths of distribution and trajectory matching. We develop a sampling-steps-aware objective that decouples learning targets across different steps, enabling more adjustable sampling. Our model, TDM, outperforms existing methods on various backbones, delivering superior quality and significantly reduced training costs.
arXiv Detail & Related papers (2025-03-09T15:53:49Z)
FADA: Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation [55.424665700339695]
Diffusion-based audio-driven talking avatar methods have recently gained attention for their high-fidelity, vivid, and expressive results. Despite the development of various distillation techniques for diffusion models, we found that naive diffusion distillation methods do not yield satisfactory results. We propose FADA (Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation) to address this problem.
arXiv Detail & Related papers (2024-12-22T08:19:22Z)
Distillation-Free One-Step Diffusion for Real-World Image Super-Resolution [81.81748032199813]
We propose a Distillation-Free One-Step Diffusion model. Specifically, we propose a noise-aware discriminator (NAD) to participate in adversarial training. We improve the perceptual loss with edge-aware DISTS (EA-DISTS) to enhance the model's ability to generate fine details.
arXiv Detail & Related papers (2024-10-05T16:41:36Z)
One Step Diffusion-based Super-Resolution with Time-Aware Distillation [60.262651082672235]
Diffusion-based image super-resolution (SR) methods have shown promise in reconstructing high-resolution images with fine details from low-resolution counterparts. Recent techniques have been devised to enhance the sampling efficiency of diffusion-based SR models via knowledge distillation. We propose a time-aware diffusion distillation method, named TAD-SR, to accomplish effective and efficient image super-resolution.
arXiv Detail & Related papers (2024-08-14T11:47:22Z)
EM Distillation for One-step Diffusion Models [65.57766773137068]
We propose a maximum likelihood-based approach that distills a diffusion model to a one-step generator model with minimal loss of quality. We develop a reparametrized sampling scheme and a noise cancellation technique that together stabilizes the distillation process.
arXiv Detail & Related papers (2024-05-27T05:55:22Z)
Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image Synthesis [20.2271205957037]
Hyper-SD is a novel framework that amalgamates the advantages of ODE Trajectory Preservation and Reformulation. We introduce Trajectory Segmented Consistency Distillation to progressively perform consistent distillation within pre-defined time-step segments. We incorporate human feedback learning to boost the performance of the model in a low-step regime and mitigate the performance loss incurred by the distillation process.
arXiv Detail & Related papers (2024-04-21T15:16:05Z)
Score identity Distillation: Exponentially Fast Distillation of Pretrained Diffusion Models for One-Step Generation [61.03530321578825]
We introduce Score identity Distillation (SiD), an innovative data-free method that distills the generative capabilities of pretrained diffusion models into a single-step generator. SiD not only facilitates an exponentially fast reduction in Fr'echet inception distance (FID) during distillation but also approaches or even exceeds the FID performance of the original teacher diffusion models.
arXiv Detail & Related papers (2024-04-05T12:30:19Z)
Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation [24.236841051249243]
Distillation methods aim to shift the model from many-shot to single-step inference. We introduce Latent Adversarial Diffusion Distillation (LADD), a novel distillation approach overcoming the limitations of ADD. In contrast to pixel-based ADD, LADD utilizes generative features from pretrained latent diffusion models.
arXiv Detail & Related papers (2024-03-18T17:51:43Z)
SCott: Accelerating Diffusion Models with Stochastic Consistency Distillation [74.32186107058382]
We propose Consistency Distillation (SCott) to enable accelerated text-to-image generation. SCott distills the ordinary differential equation solvers-based sampling process of a pre-trained teacher model into a student. On the MSCOCO-2017 5K dataset with a Stable Diffusion-V1.5 teacher, SCott achieves an FID of 21.9 with 2 sampling steps, surpassing that of the 1-step InstaFlow (23.4) and the 4-step UFOGen (22.1)
arXiv Detail & Related papers (2024-03-03T13:08:32Z)
One-Step Diffusion Distillation via Deep Equilibrium Models [64.11782639697883]
We introduce a simple yet effective means of distilling diffusion models directly from initial noise to the resulting image. Our method enables fully offline training with just noise/image pairs from the diffusion model. We demonstrate that the DEQ architecture is crucial to this capability, as GET matches a $5times$ larger ViT in terms of FID scores.
arXiv Detail & Related papers (2023-12-12T07:28:40Z)
Adversarial Diffusion Distillation [18.87099764514747]
Adversarial Diffusion Distillation (ADD) is a novel training approach that efficiently samples large-scale foundational image diffusion models in just 1-4 steps. We use score distillation to leverage large-scale off-the-shelf image diffusion models as a teacher signal. Our model clearly outperforms existing few-step methods in a single step and reaches the performance of state-of-the-art diffusion models (SDXL) in only four steps.
arXiv Detail & Related papers (2023-11-28T18:53:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.