HarmoniCa: Harmonizing Training and Inference for Better Feature Caching in Diffusion Transformer Acceleration
- URL: http://arxiv.org/abs/2410.01723v3
- Date: Fri, 31 Jan 2025 14:26:05 GMT
- Title: HarmoniCa: Harmonizing Training and Inference for Better Feature Caching in Diffusion Transformer Acceleration
- Authors: Yushi Huang, Zining Wang, Ruihao Gong, Jing Liu, Xinjie Zhang, Jinyang Guo, Xianglong Liu, Jun Zhang,
- Abstract summary: We propose a novel learning-based caching framework dubbed HarmoniCa.
It incorporates Step-Wise Denoising Training (SDT) to ensure the continuity of the denoising process.
It also incorporates an Image Error Proxy-Guided Objective (IEPO) to balance image quality against cache utilization.
- Score: 31.982294870690925
- License:
- Abstract: Diffusion Transformers (DiTs) excel in generative tasks but face practical deployment challenges due to high inference costs. Feature caching, which stores and retrieves redundant computations, offers the potential for acceleration. Existing learning-based caching, though adaptive, overlooks the impact of the prior timestep. It also suffers from misaligned objectives--aligned predicted noise vs. high-quality images--between training and inference. These two discrepancies compromise both performance and efficiency. To this end, we harmonize training and inference with a novel learning-based caching framework dubbed HarmoniCa. It first incorporates Step-Wise Denoising Training (SDT) to ensure the continuity of the denoising process, where prior steps can be leveraged. In addition, an Image Error Proxy-Guided Objective (IEPO) is applied to balance image quality against cache utilization through an efficient proxy to approximate the image error. Extensive experiments across $8$ models, $4$ samplers, and resolutions from $256\times256$ to $2K$ demonstrate superior performance and speedup of our framework. For instance, it achieves over $40\%$ latency reduction (i.e., $2.07\times$ theoretical speedup) and improved performance on PixArt-$\alpha$. Remarkably, our image-free approach reduces training time by $25\%$ compared with the previous method.
Related papers
- FreCaS: Efficient Higher-Resolution Image Generation via Frequency-aware Cascaded Sampling [13.275724439963188]
FreCaS decomposes the sampling process into cascaded stages with gradually increased resolutions and expanding frequency bands.
FreCaS significantly outperforms state-of-the-art methods in image quality and generation speed.
arXiv Detail & Related papers (2024-10-24T03:56:44Z) - Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget [53.311109531586844]
We demonstrate very low-cost training of large-scale T2I diffusion transformer models.
We train a 1.16 billion parameter sparse transformer with only $1,890 economical cost and achieve a 12.7 FID in zero-shot generation.
We aim to release our end-to-end training pipeline to further democratize the training of large-scale diffusion models on micro-budgets.
arXiv Detail & Related papers (2024-07-22T17:23:28Z) - Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching [56.286064975443026]
We make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through a caching mechanism, can be readily removed even without updating the model parameters.
We introduce a novel scheme, named Learningto-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers.
Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-r, alongside prior cache-based methods at the same inference speed.
arXiv Detail & Related papers (2024-06-03T18:49:57Z) - Implicit Image-to-Image Schrodinger Bridge for Image Restoration [13.138398298354113]
The Image-to-Image Schr"odinger Bridge (I$2$SB) presents a promising alternative by starting the generative process from corrupted images.
We introduce the Implicit Image-to-Image Schr"odinger Bridge (I$3$SB) to further accelerate the generative process of I$2$SB.
arXiv Detail & Related papers (2024-03-10T03:22:57Z) - DeepCache: Accelerating Diffusion Models for Free [65.02607075556742]
DeepCache is a training-free paradigm that accelerates diffusion models from the perspective of model architecture.
DeepCache capitalizes on the inherent temporal redundancy observed in the sequential denoising steps of diffusion models.
Under the same throughput, DeepCache effectively achieves comparable or even marginally improved results with DDIM or PLMS.
arXiv Detail & Related papers (2023-12-01T17:01:06Z) - ACT-Diffusion: Efficient Adversarial Consistency Training for One-step Diffusion Models [59.90959789767886]
We show that optimizing consistency training loss minimizes the Wasserstein distance between target and generated distributions.
By incorporating a discriminator into the consistency training framework, our method achieves improved FID scores on CIFAR10 and ImageNet 64$times$64 and LSUN Cat 256$times$256 datasets.
arXiv Detail & Related papers (2023-11-23T16:49:06Z) - Efficient Diffusion Training via Min-SNR Weighting Strategy [78.5801305960993]
We treat the diffusion training as a multi-task learning problem and introduce a simple yet effective approach referred to as Min-SNR-$gamma$.
Our results demonstrate a significant improvement in converging speed, 3.4$times$ faster than previous weighting strategies.
It is also more effective, achieving a new record FID score of 2.06 on the ImageNet $256times256$ benchmark using smaller architectures than that employed in previous state-of-the-art.
arXiv Detail & Related papers (2023-03-16T17:59:56Z) - Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution.
Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x.
We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.