PTQ4ADM: Post-Training Quantization for Efficient Text Conditional Audio Diffusion Models
- URL: http://arxiv.org/abs/2409.13894v1
- Date: Fri, 20 Sep 2024 20:52:56 GMT
- Title: PTQ4ADM: Post-Training Quantization for Efficient Text Conditional Audio Diffusion Models
- Authors: Jayneel Vora, Aditya Krishnan, Nader Bouacida, Prabhu RV Shankar, Prasant Mohapatra,
- Abstract summary: This work introduces PTQ4ADM, a novel framework for quantizing audio diffusion models (ADMs)
Our key contributions include (1) a coverage-driven prompt augmentation method and (2) an activation-aware calibration set generation algorithm for text-conditional ADMs.
Extensive experiments demonstrate PTQ4ADM's capability to reduce the model size by up to 70% while achieving synthesis quality metrics comparable to full-precision models.
- Score: 8.99127212785609
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Denoising diffusion models have emerged as state-of-the-art in generative tasks across image, audio, and video domains, producing high-quality, diverse, and contextually relevant data. However, their broader adoption is limited by high computational costs and large memory footprints. Post-training quantization (PTQ) offers a promising approach to mitigate these challenges by reducing model complexity through low-bandwidth parameters. Yet, direct application of PTQ to diffusion models can degrade synthesis quality due to accumulated quantization noise across multiple denoising steps, particularly in conditional tasks like text-to-audio synthesis. This work introduces PTQ4ADM, a novel framework for quantizing audio diffusion models(ADMs). Our key contributions include (1) a coverage-driven prompt augmentation method and (2) an activation-aware calibration set generation algorithm for text-conditional ADMs. These techniques ensure comprehensive coverage of audio aspects and modalities while preserving synthesis fidelity. We validate our approach on TANGO, Make-An-Audio, and AudioLDM models for text-conditional audio generation. Extensive experiments demonstrate PTQ4ADM's capability to reduce the model size by up to 70\% while achieving synthesis quality metrics comparable to full-precision models($<$5\% increase in FD scores). We show that specific layers in the backbone network can be quantized to 4-bit weights and 8-bit activations without significant quality loss. This work paves the way for more efficient deployment of ADMs in resource-constrained environments.
Related papers
- Enhancing Generalization in Data-free Quantization via Mixup-class Prompting [8.107092196905157]
Post-training quantization (PTQ) improves efficiency but struggles with limited calibration data, especially under privacy constraints.<n>Data-free quantization (DFQ) mitigates this by generating synthetic images using generative models such as generative adversarial networks (GANs) and text-conditioned latent diffusion models (LDMs)<n>We propose textbfmixup-class prompt, a mixup-based text prompting strategy that fuses multiple class labels at the text prompt level to generate diverse, robust synthetic data.
arXiv Detail & Related papers (2025-07-29T16:00:20Z) - Edge-ASR: Towards Low-Bit Quantization of Automatic Speech Recognition Models [8.589209709453026]
Quantization, particularly Post-Training Quantization (PTQ), offers an effective way to reduce model size and inference cost without retraining.<n>We present a benchmark of eight state-of-the-art (SOTA) PTQ methods applied to two leading edge-ASR model families, Whisper and Moonshine.<n>Our results characterize the trade-offs between efficiency and accuracy, demonstrating that even $3$-bit quantization can succeed on high capacity models.
arXiv Detail & Related papers (2025-07-10T16:00:27Z) - FLowHigh: Towards Efficient and High-Quality Audio Super-Resolution with Single-Step Flow Matching [29.12032530972612]
FLowHigh is a novel approach that integrates flow matching, a highly efficient generative model, into audio super-resolution.
The proposed method generates high-fidelity, high-resolution audio through a single-step sampling process across various input sampling rates.
arXiv Detail & Related papers (2025-01-09T02:30:26Z) - TCAQ-DM: Timestep-Channel Adaptive Quantization for Diffusion Models [49.65286242048452]
We propose a novel method dubbed Timestep-Channel Adaptive Quantization for Diffusion Models (TCAQ-DM)
The proposed method substantially outperforms the state-of-the-art approaches in most cases.
arXiv Detail & Related papers (2024-12-21T16:57:54Z) - PassionSR: Post-Training Quantization with Adaptive Scale in One-Step Diffusion based Image Super-Resolution [87.89013794655207]
Diffusion-based image super-resolution (SR) models have shown superior performance at the cost of multiple denoising steps.
We propose a novel post-training quantization approach with adaptive scale in one-step diffusion (OSD) image SR, PassionSR.
Our PassionSR achieves significant advantages over recent leading low-bit quantization methods for image SR.
arXiv Detail & Related papers (2024-11-26T04:49:42Z) - Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data [69.7174072745851]
We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data.
To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization.
To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models.
arXiv Detail & Related papers (2024-10-02T22:05:36Z) - QNCD: Quantization Noise Correction for Diffusion Models [15.189069680672239]
Diffusion models have revolutionized image synthesis, setting new benchmarks in quality and creativity.
Post-training quantization presents a solution to accelerate sampling, aibeit at the expense of sample quality.
We introduce a unified Quantization Noise Correction Scheme (QNCD) aimed at minishing quantization noise throughout the sampling process.
arXiv Detail & Related papers (2024-03-28T04:24:56Z) - EDA-DM: Enhanced Distribution Alignment for Post-Training Quantization of Diffusion Models [8.742501879586309]
Quantization can effectively reduce model complexity, and post-training quantization (PTQ) is highly promising for compressing and accelerating diffusion models.<n>Existing PTQ methods suffer from distribution mismatch issues at both calibration sample level and reconstruction output level.<n>We propose EDA-DM, a standardized PTQ method that efficiently addresses the above issues.
arXiv Detail & Related papers (2024-01-09T14:42:49Z) - TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models [52.454274602380124]
Diffusion models heavily depend on the time-step $t$ to achieve satisfactory multi-round denoising.
We propose a Temporal Feature Maintenance Quantization (TFMQ) framework building upon a Temporal Information Block.
Powered by the pioneering block design, we devise temporal information aware reconstruction (TIAR) and finite set calibration (FSC) to align the full-precision temporal features.
arXiv Detail & Related papers (2023-11-27T12:59:52Z) - EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit Diffusion Models [21.17675493267517]
Post-training quantization (PTQ) and quantization-aware training (QAT) are two main approaches to compress and accelerate diffusion models.
We introduce a data-free and parameter-efficient fine-tuning framework for low-bit diffusion models, dubbed EfficientDM, to achieve QAT-level performance with PTQ-like efficiency.
Our method significantly outperforms previous PTQ-based diffusion models while maintaining similar time and data efficiency.
arXiv Detail & Related papers (2023-10-05T02:51:53Z) - Enhancing Quantised End-to-End ASR Models via Personalisation [12.971231464928806]
We propose a novel strategy of personalisation for a quantised model (PQM)
PQM uses a 4-bit NormalFloat Quantisation (NF4) approach for model quantisation and low-rank adaptation (LoRA) for SAT.
Experiments have been performed on the LibriSpeech and the TED-LIUM 3 corpora.
arXiv Detail & Related papers (2023-09-17T02:35:21Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - Q-Diffusion: Quantizing Diffusion Models [52.978047249670276]
Post-training quantization (PTQ) is considered a go-to compression method for other tasks.
We propose a novel PTQ method specifically tailored towards the unique multi-timestep pipeline and model architecture.
We show that our proposed method is able to quantize full-precision unconditional diffusion models into 4-bit while maintaining comparable performance.
arXiv Detail & Related papers (2023-02-08T19:38:59Z) - Adversarial Audio Synthesis with Complex-valued Polynomial Networks [60.231877895663956]
Time-frequency (TF) representations in audio have been increasingly modeled real-valued networks.
We introduce complex-valued networks called APOLLO, that integrate such complex-valued representations in a natural way.
APOLLO results in $17.5%$ improvement over adversarial methods and $8.2%$ over the state-of-the-art diffusion models on SC09 in audio generation.
arXiv Detail & Related papers (2022-06-14T12:58:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.