DiCo: Revitalizing ConvNets for Scalable and Efficient Diffusion Modeling
- URL: http://arxiv.org/abs/2505.11196v1
- Date: Fri, 16 May 2025 12:54:04 GMT
- Title: DiCo: Revitalizing ConvNets for Scalable and Efficient Diffusion Modeling
- Authors: Yuang Ai, Qihang Fan, Xuefeng Hu, Zhenheng Yang, Ran He, Huaibo Huang,
- Abstract summary: Diffusion Transformer (DiT) is a promising diffusion model for visual generation but incurs significant computational overhead.<n>In this paper, we revisit convolution as an alternative building block for constructing efficient and expressive diffusion models.<n>We introduce Diffusion ConvNet (DiCo), a family of diffusion models built entirely from standard ConvNet modules.
- Score: 53.33281984430122
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion Transformer (DiT), a promising diffusion model for visual generation, demonstrates impressive performance but incurs significant computational overhead. Intriguingly, analysis of pre-trained DiT models reveals that global self-attention is often redundant, predominantly capturing local patterns-highlighting the potential for more efficient alternatives. In this paper, we revisit convolution as an alternative building block for constructing efficient and expressive diffusion models. However, naively replacing self-attention with convolution typically results in degraded performance. Our investigations attribute this performance gap to the higher channel redundancy in ConvNets compared to Transformers. To resolve this, we introduce a compact channel attention mechanism that promotes the activation of more diverse channels, thereby enhancing feature diversity. This leads to Diffusion ConvNet (DiCo), a family of diffusion models built entirely from standard ConvNet modules, offering strong generative performance with significant efficiency gains. On class-conditional ImageNet benchmarks, DiCo outperforms previous diffusion models in both image quality and generation speed. Notably, DiCo-XL achieves an FID of 2.05 at 256x256 resolution and 2.53 at 512x512, with a 2.7x and 3.1x speedup over DiT-XL/2, respectively. Furthermore, our largest model, DiCo-H, scaled to 1B parameters, reaches an FID of 1.90 on ImageNet 256x256-without any additional supervision during training. Code: https://github.com/shallowdream204/DiCo.
Related papers
- Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models [33.519892081718716]
We propose aligning the latent space with pre-trained vision foundation models when training the visual tokenizers.<n>Our proposed VA-VAE significantly expands the reconstruction-generation frontier of latent diffusion models.<n>We build an enhanced DiT baseline with improved training strategies and architecture designs, termed LightningDiT.
arXiv Detail & Related papers (2025-01-02T18:59:40Z) - Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient [52.96232442322824]
Collaborative Decoding (CoDe) is a novel efficient decoding strategy tailored for the Visual Auto-Regressive ( VAR) framework.<n>CoDe capitalizes on two critical observations: the substantially reduced parameter demands at larger scales and the exclusive generation patterns across different scales.<n>CoDe achieves a 1.7x speedup, slashes memory usage by around 50%, and preserves image quality with only a negligible FID increase from 1.95 to 1.98.
arXiv Detail & Related papers (2024-11-26T15:13:15Z) - Dynamic Diffusion Transformer [67.13876021157887]
Diffusion Transformer (DiT) has demonstrated superior performance but suffers from substantial computational costs.
We propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions during generation.
With 3% additional fine-tuning, our method reduces the FLOPs of DiT-XL by 51%, accelerates generation by 1.73, and achieves a competitive FID score of 2.07 on ImageNet.
arXiv Detail & Related papers (2024-10-04T14:14:28Z) - Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models and Time-Dependent Layer Normalization [26.926712014346432]
This paper presents innovative enhancements to diffusion models by integrating a novel multi-resolution network and time-dependent layer normalization.<n>Our method's efficacy is demonstrated on the class-conditional ImageNet generation benchmark, setting new state-of-the-art FID scores of 1.70 on ImageNet 256 x 256 and 2.89 on ImageNet 512 x 512.
arXiv Detail & Related papers (2024-06-13T17:59:58Z) - DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention [82.24166963631949]
Diffusion Gated Linear Attention Transformers (DiG) is a simple, adoptable solution with minimal parameter overhead.<n>We offer two variants, i,e, a plain and U-shape architecture, showing superior efficiency and competitive effectiveness.
arXiv Detail & Related papers (2024-05-28T17:59:33Z) - TerDiT: Ternary Diffusion Models with Transformers [88.03738506648291]
TerDiT is the first quantization-aware training scheme for low-bit diffusion transformer models.<n>We focus on the ternarization of DiT networks, with model sizes ranging from 600M to 4.2B, and image resolution from 256$times$256 to 512$times$512.
arXiv Detail & Related papers (2024-05-23T17:57:24Z) - A-SDM: Accelerating Stable Diffusion through Redundancy Removal and
Performance Optimization [54.113083217869516]
In this work, we first explore the computational redundancy part of the network.
We then prune the redundancy blocks of the model and maintain the network performance.
Thirdly, we propose a global-regional interactive (GRI) attention to speed up the computationally intensive attention part.
arXiv Detail & Related papers (2023-12-24T15:37:47Z) - KOALA: Empirical Lessons Toward Memory-Efficient and Fast Diffusion Models for Text-to-Image Synthesis [52.42320594388199]
We present three key practices in building an efficient text-to-image model.
Based on these findings, we build two types of efficient text-to-image models, called KOALA-Turbo &-Lightning.
Unlike SDXL, our KOALA models can generate 1024px high-resolution images on consumer-grade GPUs with 8GB of VRAMs (3060Ti)
arXiv Detail & Related papers (2023-12-07T02:46:18Z) - Scalable Diffusion Models with Transformers [18.903245758902834]
We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches.
We find that DiTs with higher Gflops -- through increased transformer depth/width or increased number of input tokens -- consistently have lower FID.
arXiv Detail & Related papers (2022-12-19T18:59:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.