Related papers: Q&C: When Quantization Meets Cache in Efficient Image Generation

Q&C: When Quantization Meets Cache in Efficient Image Generation

URL: http://arxiv.org/abs/2503.02508v1
Date: Tue, 04 Mar 2025 11:19:02 GMT
Title: Q&C: When Quantization Meets Cache in Efficient Image Generation
Authors: Xin Ding, Xin Li, Haotong Qin, Zhibo Chen,
Abstract summary: We find that the combination of quantization and cache mechanisms for Diffusion Transformers (DiTs) is not straightforward.<n>We propose a hybrid acceleration method by tackling the above challenges.<n>Our method has accelerated DiTs by 12.7x while preserving competitive generation capability.
Score: 24.783679431414686
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Quantization and cache mechanisms are typically applied individually for efficient Diffusion Transformers (DiTs), each demonstrating notable potential for acceleration. However, the promoting effect of combining the two mechanisms on efficient generation remains under-explored. Through empirical investigation, we find that the combination of quantization and cache mechanisms for DiT is not straightforward, and two key challenges lead to severe catastrophic performance degradation: (i) the sample efficacy of calibration datasets in post-training quantization (PTQ) is significantly eliminated by cache operation; (ii) the combination of the above mechanisms introduces more severe exposure bias within sampling distribution, resulting in amplified error accumulation in the image generation process. In this work, we take advantage of these two acceleration mechanisms and propose a hybrid acceleration method by tackling the above challenges, aiming to further improve the efficiency of DiTs while maintaining excellent generation capability. Concretely, a temporal-aware parallel clustering (TAP) is designed to dynamically improve the sample selection efficacy for the calibration within PTQ for different diffusion steps. A variance compensation (VC) strategy is derived to correct the sampling distribution. It mitigates exposure bias through an adaptive correction factor generation. Extensive experiments have shown that our method has accelerated DiTs by 12.7x while preserving competitive generation capability. The code will be available at https://github.com/xinding-sys/Quant-Cache.

Related papers

CacheQuant: Comprehensively Accelerated Diffusion Models [3.78219736760145]
CacheQuant is a novel training-free paradigm that comprehensively accelerates diffusion models by jointly optimizing model caching and quantization techniques.<n> Experimental results show that CacheQuant achieves a 5.18 speedup and 4 compression for Stable Diffusion on MS-COCO, with only a 0.02 loss in CLIP score.
arXiv Detail & Related papers (2025-03-03T09:04:51Z)
TQ-DiT: Efficient Time-Aware Quantization for Diffusion Transformers [3.389132862174821]
We introduce model quantization, which represents the weights and activation values with lower precision.<n>Time-grouping quantization (TGQ) is proposed to reduce quantization error caused by temporal variation in activations.<n>The proposed algorithm achieves performance comparable to the original full-precision model with only a 0.29 increase in FID at W8A8.
arXiv Detail & Related papers (2025-02-06T13:14:52Z)
MPQ-DM: Mixed Precision Quantization for Extremely Low Bit Diffusion Models [37.061975191553]
This paper presents MPQ-DM, a Mixed-Precision Quantization method for Diffusion Models.<n>To mitigate the quantization error caused by outlier severe weight channels, we propose an Outlier-Driven Mixed Quantization technique.<n>To robustly learn representations crossing time steps, we construct a Time-Smoothed Relation Distillation scheme.
arXiv Detail & Related papers (2024-12-16T08:31:55Z)
Towards Stabilized and Efficient Diffusion Transformers through Long-Skip-Connections with Spectral Constraints [51.83081671798784]
Diffusion Transformers (DiT) have emerged as a powerful architecture for image and video generation, offering superior quality and scalability. DiT's practical application suffers from inherent dynamic feature instability, leading to error amplification during cached inference. We propose Skip-DiT, a novel DiT variant enhanced with Long-Skip-Connections (LSCs) - the key efficiency component in U-Nets.
arXiv Detail & Related papers (2024-11-26T17:28:10Z)
Scalable and Effective Negative Sample Generation for Hyperedge Prediction [55.9298019975967]
Hyperedge prediction is crucial for understanding complex multi-entity interactions in web-based applications. Traditional methods often face difficulties in generating high-quality negative samples due to imbalance between positive and negative instances. We present the scalable and effective negative sample generation for Hyperedge Prediction (SEHP) framework, which utilizes diffusion models to tackle these challenges.
arXiv Detail & Related papers (2024-11-19T09:16:25Z)
2DQuant: Low-bit Post-Training Quantization for Image Super-Resolution [83.09117439860607]
Low-bit quantization has become widespread for compressing image super-resolution (SR) models for edge deployment. It is notorious that low-bit quantization degrades the accuracy of SR models compared to their full-precision (FP) counterparts. We present a dual-stage low-bit post-training quantization (PTQ) method for image super-resolution, namely 2DQuant, which achieves efficient and accurate SR under low-bit quantization.
arXiv Detail & Related papers (2024-06-10T06:06:11Z)
Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching [56.286064975443026]
We make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through a caching mechanism, can be readily removed even without updating the model parameters. We introduce a novel scheme, named Learningto-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers. Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-r, alongside prior cache-based methods at the same inference speed.
arXiv Detail & Related papers (2024-06-03T18:49:57Z)
ResShift: Efficient Diffusion Model for Image Super-resolution by Residual Shifting [70.83632337581034]
Diffusion-based image super-resolution (SR) methods are mainly limited by the low inference speed. We propose a novel and efficient diffusion model for SR that significantly reduces the number of diffusion steps. Our method constructs a Markov chain that transfers between the high-resolution image and the low-resolution image by shifting the residual.
arXiv Detail & Related papers (2023-07-23T15:10:02Z)
Q-Diffusion: Quantizing Diffusion Models [52.978047249670276]
Post-training quantization (PTQ) is considered a go-to compression method for other tasks. We propose a novel PTQ method specifically tailored towards the unique multi-timestep pipeline and model architecture. We show that our proposed method is able to quantize full-precision unconditional diffusion models into 4-bit while maintaining comparable performance.
arXiv Detail & Related papers (2023-02-08T19:38:59Z)
Accelerating Score-based Generative Models with Preconditioned Diffusion Sampling [36.02321871608158]
We propose a model-agnostic preconditioned diffusion sampling (PDS) method that leverages matrix preconditioning to alleviate the problem. PDS consistently accelerates off-the-shelf SGMs whilst maintaining the synthesis quality. In particular, PDS can accelerate by up to 29x on more challenging high resolution (1024x1024) image generation.
arXiv Detail & Related papers (2022-07-05T17:55:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.