Related papers: DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation

DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation

URL: http://arxiv.org/abs/2405.20289v1
Date: Thu, 30 May 2024 17:40:11 GMT
Title: DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation
Authors: Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas Bryan,
Abstract summary: Controllable music generation methods are critical for human-centered AI-based music creation. We propose Distilled Diffusion Inference-Time T -Optimization (or DITTO-2), a new method to speed up inference-time optimization-based control.
Score: 43.61383132919089
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Controllable music generation methods are critical for human-centered AI-based music creation, but are currently limited by speed, quality, and control design trade-offs. Diffusion Inference-Time T-optimization (DITTO), in particular, offers state-of-the-art results, but is over 10x slower than real-time, limiting practical use. We propose Distilled Diffusion Inference-Time T -Optimization (or DITTO-2), a new method to speed up inference-time optimization-based control and unlock faster-than-real-time generation for a wide-variety of applications such as music inpainting, outpainting, intensity, melody, and musical structure control. Our method works by (1) distilling a pre-trained diffusion model for fast sampling via an efficient, modified consistency or consistency trajectory distillation process (2) performing inference-time optimization using our distilled model with one-step sampling as an efficient surrogate optimization task and (3) running a final multi-step sampling generation (decoding) using our estimated noise latents for best-quality, fast, controllable generation. Through thorough evaluation, we find our method not only speeds up generation over 10-20x, but simultaneously improves control adherence and generation quality all at once. Furthermore, we apply our approach to a new application of maximizing text adherence (CLAP score) and show we can convert an unconditional diffusion model without text inputs into a model that yields state-of-the-art text control. Sound examples can be found at https://ditto-music.github.io/ditto2/.

Related papers

Fast T2T: Optimization Consistency Speeds Up Diffusion-Based Training-to-Testing Solving for Combinatorial Optimization [83.65278205301576]
We propose to learn direct mappings from different noise levels to the optimal solution for a given instance, facilitating high-quality generation with minimal shots. This is achieved through an optimization consistency training protocol, which minimizes the difference among samples. Experiments on two popular tasks, the Traveling Salesman Problem (TSP) and Maximal Independent Set (MIS), demonstrate the superiority of Fast T2T regarding both solution quality and efficiency.
arXiv Detail & Related papers (2025-02-05T07:13:43Z)
Diffusion Models as Optimizers for Efficient Planning in Offline RL [47.0835433289033]
Diffusion models have shown strong competitiveness in offline reinforcement learning tasks. We propose a faster autoregressive model to handle the generation of feasible trajectories. This allows us to achieve more efficient planning without sacrificing capability.
arXiv Detail & Related papers (2024-07-23T03:00:01Z)
DITTO: Diffusion Inference-Time T-Optimization for Music Generation [49.90109850026932]
Diffusion Inference-Time T-Optimization (DITTO) is a frame-work for controlling pre-trained text-to-music diffusion models at inference-time. We demonstrate a surprisingly wide-range of applications for music generation including inpainting, outpainting, and looping as well as intensity, melody, and musical structure control.
arXiv Detail & Related papers (2024-01-22T18:10:10Z)
AdaDiff: Adaptive Step Selection for Fast Diffusion Models [82.78899138400435]
We introduce AdaDiff, a lightweight framework designed to learn instance-specific step usage policies. AdaDiff is optimized using a policy method to maximize a carefully designed reward function. We conduct experiments on three image generation and two video generation benchmarks and demonstrate that our approach achieves similar visual quality compared to the baseline.
arXiv Detail & Related papers (2023-11-24T11:20:38Z)
The Missing U for Efficient Diffusion Models [3.712196074875643]
Diffusion Probabilistic Models yield record-breaking performance in tasks such as image synthesis, video generation, and molecule design. Despite their capabilities, their efficiency, especially in the reverse process, remains a challenge due to slow convergence rates and high computational costs. We introduce an approach that leverages continuous dynamical systems to design a novel denoising network for diffusion models.
arXiv Detail & Related papers (2023-10-31T00:12:14Z)
Towards More Accurate Diffusion Model Acceleration with A Timestep Aligner [84.97253871387028]
A diffusion model, which is formulated to produce an image using thousands of denoising steps, usually suffers from a slow inference speed. We propose a timestep aligner that helps find a more accurate integral direction for a particular interval at the minimum cost. Experiments show that our plug-in design can be trained efficiently and boost the inference performance of various state-of-the-art acceleration methods.
arXiv Detail & Related papers (2023-10-14T02:19:07Z)
BOOT: Data-free Distillation of Denoising Diffusion Models with Bootstrapping [64.54271680071373]
Diffusion models have demonstrated excellent potential for generating diverse images. Knowledge distillation has been recently proposed as a remedy that can reduce the number of inference steps to one or a few. We present a novel technique called BOOT, that overcomes limitations with an efficient data-free distillation algorithm.
arXiv Detail & Related papers (2023-06-08T20:30:55Z)
ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech [63.780196620966905]
We propose ProDiff, on progressive fast diffusion model for high-quality text-to-speech. ProDiff parameterizes the denoising model by directly predicting clean data to avoid distinct quality degradation in accelerating sampling. Our evaluation demonstrates that ProDiff needs only 2 iterations to synthesize high-fidelity mel-spectrograms. ProDiff enables a sampling speed of 24x faster than real-time on a single NVIDIA 2080Ti GPU.
arXiv Detail & Related papers (2022-07-13T17:45:43Z)
Learning to Efficiently Sample from Diffusion Probabilistic Models [49.58748345998702]
Denoising Diffusion Probabilistic Models (DDPMs) can yield high-fidelity samples and competitive log-likelihoods across a range of domains. We introduce an exact dynamic programming algorithm that finds the optimal discrete time schedules for any pre-trained DDPM.
arXiv Detail & Related papers (2021-06-07T17:15:07Z)
Bi-level Off-policy Reinforcement Learning for Volt/VAR Control Involving Continuous and Discrete Devices [2.079959811127612]
In Volt/Var control, both slow timescale discrete devices (STDDs) and fast timescale continuous devices (FTCDs) are involved. Traditional optimization methods are heavily based on accurate models of the system, but sometimes impractical because of their unaffordable effort on modelling. In this paper, a novel bi-level off-policy reinforcement learning (RL) algorithm is proposed to solve this problem in a model-free manner.
arXiv Detail & Related papers (2021-04-13T02:22:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.