Speedrunning ImageNet Diffusion
- URL: http://arxiv.org/abs/2512.12386v1
- Date: Sat, 13 Dec 2025 16:30:28 GMT
- Title: Speedrunning ImageNet Diffusion
- Authors: Swayam Bhanded,
- Abstract summary: SR-DiT (Speedrun Diffusion Transformer) is a framework that integrates token routing, architectural improvements, and training modifications on top of representation alignment.<n>Our approach achieves FID 3.49 and KDD 0.319 on ImageNet-256 using only a 140M parameter model at 400K iterations without classifier-free guidance.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances have significantly improved the training efficiency of diffusion transformers. However, these techniques have largely been studied in isolation, leaving unexplored the potential synergies from combining multiple approaches. We present SR-DiT (Speedrun Diffusion Transformer), a framework that systematically integrates token routing, architectural improvements, and training modifications on top of representation alignment. Our approach achieves FID 3.49 and KDD 0.319 on ImageNet-256 using only a 140M parameter model at 400K iterations without classifier-free guidance - comparable to results from 685M parameter models trained significantly longer. To our knowledge, this is a state-of the-art result at this model size. Through extensive ablation studies, we identify which technique combinations are most effective and document both synergies and incompatibilities. We release our framework as a computationally accessible baseline for future research.
Related papers
- Diffusion As Self-Distillation: End-to-End Latent Diffusion In One Model [53.77953728335891]
Latent Diffusion Models rely on a complex, three-part architecture consisting of a separate encoder, decoder, and diffusion network.<n>We propose Diffusion as Self-Distillation (DSD), a new framework with key modifications to the training objective that stabilize the latent space.<n>This approach enables, for the first time, the stable end-to-end training of a single network that simultaneously learns to encode, decode, and perform diffusion.
arXiv Detail & Related papers (2025-11-18T17:58:16Z) - E-MMDiT: Revisiting Multimodal Diffusion Transformer Design for Fast Image Synthesis under Limited Resources [12.244453688491731]
Efficient Multimodal Diffusion Transformer (E-MMDiT) is an efficient and lightweight multimodal diffusion model with only 304M parameters for fast image synthesis.<n>Our model for 512px generation, trained with only 25M public data in 1.5 days on a single node of 8 AMD MI300X GPUs, achieves 0.66 on GenEval and easily reaches to 0.72 with some post-training techniques such as GRPO.
arXiv Detail & Related papers (2025-10-31T03:13:08Z) - ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion [7.233066974580282]
Text-to-image diffusion models often exhibit degraded performance when generating images beyond their training resolution.<n>Recent training-free methods can mitigate this limitation, but they often require substantial computation or are incompatible with recent Diffusion Transformer models.<n>We propose ScaleDiff, a model-agnostic and highly efficient framework for extending the resolution of pretrained diffusion models without any additional training.
arXiv Detail & Related papers (2025-10-29T17:17:32Z) - Improving Progressive Generation with Decomposable Flow Matching [50.63174319509629]
Decomposable Flow Matching (DFM) is a simple and effective framework for the progressive generation of visual media.<n>On Imagenet-1k 512px, DFM achieves 35.2% improvements in FDD scores over the base architecture and 26.4% over the best-performing baseline.
arXiv Detail & Related papers (2025-06-24T17:58:02Z) - TREAD: Token Routing for Efficient Architecture-agnostic Diffusion Training [25.744324109042385]
Diffusion models typically suffer from sample inefficiency and high training costs.<n>We show that TREAD reduces computational cost and simultaneously boosts model performance.<n>We achieve a competitive FID of 2.09 in a guided and 3.93 in an unguided setting.
arXiv Detail & Related papers (2025-01-08T18:38:25Z) - Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models [33.519892081718716]
We propose aligning the latent space with pre-trained vision foundation models when training the visual tokenizers.<n>Our proposed VA-VAE significantly expands the reconstruction-generation frontier of latent diffusion models.<n>We build an enhanced DiT baseline with improved training strategies and architecture designs, termed LightningDiT.
arXiv Detail & Related papers (2025-01-02T18:59:40Z) - LazyDiT: Lazy Learning for the Acceleration of Diffusion Transformers [79.07412045476872]
Diffusion Transformers have emerged as the preeminent models for a wide array of generative tasks.<n>We show that performing the full of the model at each diffusion step is unnecessary, as some computations can be skipped by lazily reusing the results of previous steps.<n>We propose a lazy learning framework that efficiently leverages cached results from earlier steps to skip redundant computations.
arXiv Detail & Related papers (2024-12-17T01:12:35Z) - TinyFusion: Diffusion Transformers Learned Shallow [52.96232442322824]
Diffusion Transformers have demonstrated remarkable capabilities in image generation but often come with excessive parameterization.<n>We present TinyFusion, a depth pruning method designed to remove redundant layers from diffusion transformers via end-to-end learning.<n>Experiments with DiT-XL show that TinyFusion can craft a shallow diffusion transformer at less than 7% of the pre-training cost, achieving a 2$times$ speedup with an FID score of 2.86.
arXiv Detail & Related papers (2024-12-02T07:05:39Z) - Revisiting Non-Autoregressive Transformers for Efficient Image Synthesis [82.72941975704374]
Non-autoregressive Transformers (NATs) have been recognized for their rapid generation.
We re-evaluate the full potential of NATs by revisiting the design of their training and inference strategies.
We propose to go beyond existing methods by directly solving the optimal strategies in an automatic framework.
arXiv Detail & Related papers (2024-06-08T13:52:20Z) - A-SDM: Accelerating Stable Diffusion through Redundancy Removal and
Performance Optimization [54.113083217869516]
In this work, we first explore the computational redundancy part of the network.
We then prune the redundancy blocks of the model and maintain the network performance.
Thirdly, we propose a global-regional interactive (GRI) attention to speed up the computationally intensive attention part.
arXiv Detail & Related papers (2023-12-24T15:37:47Z) - Analyzing and Improving the Training Dynamics of Diffusion Models [36.37845647984578]
We identify and rectify several causes for uneven and ineffective training in the popular ADM diffusion model architecture.
We find that systematic application of this philosophy eliminates the observed drifts and imbalances, resulting in considerably better networks at equal computational complexity.
arXiv Detail & Related papers (2023-12-05T11:55:47Z) - DiffiT: Diffusion Vision Transformers for Image Generation [88.08529836125399]
Vision Transformer (ViT) has demonstrated strong modeling capabilities and scalability, especially for recognition tasks.
We study the effectiveness of ViTs in diffusion-based generative learning and propose a new model denoted as Diffusion Vision Transformers (DiffiT)
DiffiT is surprisingly effective in generating high-fidelity images with significantly better parameter efficiency.
arXiv Detail & Related papers (2023-12-04T18:57:01Z) - Searching a Compact Architecture for Robust Multi-Exposure Image Fusion [55.37210629454589]
Two major stumbling blocks hinder the development, including pixel misalignment and inefficient inference.
This study introduces an architecture search-based paradigm incorporating self-alignment and detail repletion modules for robust multi-exposure image fusion.
The proposed method outperforms various competitive schemes, achieving a noteworthy 3.19% improvement in PSNR for general scenarios and an impressive 23.5% enhancement in misaligned scenarios.
arXiv Detail & Related papers (2023-05-20T17:01:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.