Home-made Diffusion Model from Scratch to Hatch
- URL: http://arxiv.org/abs/2509.06068v1
- Date: Sun, 07 Sep 2025 14:21:57 GMT
- Title: Home-made Diffusion Model from Scratch to Hatch
- Authors: Shih-Ying Yeh,
- Abstract summary: Home-made Diffusion Model (HDM) is an efficient yet powerful text-to-image diffusion model optimized for training on consumer-grade hardware.<n>HDM achieves competitive 1024x1024 generation quality while maintaining a remarkably low training cost of $535-620.
- Score: 0.9383683724544296
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We introduce Home-made Diffusion Model (HDM), an efficient yet powerful text-to-image diffusion model optimized for training (and inferring) on consumer-grade hardware. HDM achieves competitive 1024x1024 generation quality while maintaining a remarkably low training cost of $535-620 using four RTX5090 GPUs, representing a significant reduction in computational requirements compared to traditional approaches. Our key contributions include: (1) Cross-U-Transformer (XUT), a novel U-shape transformer, Cross-U-Transformer (XUT), that employs cross-attention for skip connections, providing superior feature integration that leads to remarkable compositional consistency; (2) a comprehensive training recipe that incorporates TREAD acceleration, a novel shifted square crop strategy for efficient arbitrary aspect-ratio training, and progressive resolution scaling; and (3) an empirical demonstration that smaller models (343M parameters) with carefully crafted architectures can achieve high-quality results and emergent capabilities, such as intuitive camera control. Our work provides an alternative paradigm of scaling, demonstrating a viable path toward democratizing high-quality text-to-image generation for individual researchers and smaller organizations with limited computational resources.
Related papers
- DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing [67.77471070868852]
DeepGen 1.0 is a lightweight 5B unified model for image generation and editing.<n>It is trained on only 50M samples, surpassing the 80B HunyuanImage by 28% on WISE and the 27B Qwen-Image-Edit by 37% on UniREditBench.<n>By open-sourcing our training code, weights, and datasets, we provide an efficient, high-performance alternative to democratize unified multimodal research.
arXiv Detail & Related papers (2026-02-12T17:44:24Z) - MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling [80.48332380100915]
MiniCPM-SALA is a hybrid model that integrates the high-fidelity long-context modeling of sparse attention with the global efficiency of linear attention.<n>On a single NVIDIA A6000D GPU, the model achieves up to 3.5x the inference speed of the full-attention model at the sequence length of 256K tokens.
arXiv Detail & Related papers (2026-02-12T09:37:05Z) - SnapGen++: Unleashing Diffusion Transformers for Efficient High-Fidelity Image Generation on Edge Devices [72.0937240883345]
Recent advances in diffusion transformers (DiTs) have set new standards in image generation, yet remain impractical for on-device deployment.<n>We present an efficient DiT framework tailored for mobile and edge devices that achieves transformer-level generation quality under strict resource constraints.
arXiv Detail & Related papers (2026-01-13T07:46:46Z) - Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer [38.99742258165009]
Z-Image is an efficient foundation generative model that challenges the "scale-at-all-costs" paradigm.<n>Our model achieves performance comparable to or surpassing that of leading competitors across various dimensions.<n>We release our code, weights, and online demo to foster the development of budget-friendly, yet state-of-the-art generative models.
arXiv Detail & Related papers (2025-11-27T18:52:07Z) - E-MMDiT: Revisiting Multimodal Diffusion Transformer Design for Fast Image Synthesis under Limited Resources [12.244453688491731]
Efficient Multimodal Diffusion Transformer (E-MMDiT) is an efficient and lightweight multimodal diffusion model with only 304M parameters for fast image synthesis.<n>Our model for 512px generation, trained with only 25M public data in 1.5 days on a single node of 8 AMD MI300X GPUs, achieves 0.66 on GenEval and easily reaches to 0.72 with some post-training techniques such as GRPO.
arXiv Detail & Related papers (2025-10-31T03:13:08Z) - Boosting Generative Image Modeling via Joint Image-Feature Synthesis [15.133906625258797]
We introduce a novel generative image modeling framework that seamlessly bridges the gap by leveraging a diffusion model to jointly model low-level image latents.<n>Our latent-semantic diffusion approach learns to generate coherent image-feature pairs from pure noise.<n>By eliminating the need for complex distillation objectives, our unified design simplifies training and unlocks a powerful new inference strategy: Representation Guidance.
arXiv Detail & Related papers (2025-04-22T17:41:42Z) - FlexVAR: Flexible Visual Autoregressive Modeling without Residual Prediction [91.09318592542509]
This work challenges the residual prediction paradigm in visual autoregressive modeling.<n>It presents a new Flexible Visual AutoRegressive image generation paradigm.<n>This simple, intuitive approach swiftly learns visual distributions and makes the generation process more flexible and adaptable.
arXiv Detail & Related papers (2025-02-27T17:39:17Z) - LinFusion: 1 GPU, 1 Minute, 16K Image [71.44735417472043]
We introduce a low-rank approximation of a wide spectrum of popular linear token mixers.
We find that the distilled model, termed LinFusion, achieves performance on par with or superior to the original SD.
Experiments on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion enables satisfactory and efficient zero-shot cross-resolution generation.
arXiv Detail & Related papers (2024-09-03T17:54:39Z) - DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis [56.849285913695184]
Diffusion Mamba (DiM) is a sequence model for efficient high-resolution image synthesis.
DiM architecture achieves inference-time efficiency for high-resolution images.
Experiments demonstrate the effectiveness and efficiency of our DiM.
arXiv Detail & Related papers (2024-05-23T06:53:18Z) - SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions [5.100085108873068]
We present two models, SDXS-512 and SDXS-1024, achieving inference speeds of approximately 100 FPS (30x faster than SD v1.5) and 30 FPS (60x faster than SDXL) on a single GPU.
Our training approach offers promising applications in image-conditioned control, facilitating efficient image-to-image translation.
arXiv Detail & Related papers (2024-03-25T11:16:23Z) - Make a Cheap Scaling: A Self-Cascade Diffusion Model for
Higher-Resolution Adaptation [112.08287900261898]
This paper proposes a novel self-cascade diffusion model for rapid adaptation to higher-resolution image and video generation.
Our approach achieves a 5X training speed-up and requires only an additional 0.002M tuning parameters.
Experiments demonstrate that our approach can quickly adapt to higher resolution image and video synthesis by fine-tuning for just 10k steps, with virtually no additional inference time.
arXiv Detail & Related papers (2024-02-16T07:48:35Z) - Efficient-3DiM: Learning a Generalizable Single-image Novel-view
Synthesizer in One Day [63.96075838322437]
We propose a framework to learn a single-image novel-view synthesizer.
Our framework is able to reduce the total training time from 10 days to less than 1 day.
arXiv Detail & Related papers (2023-10-04T17:57:07Z) - Dual-former: Hybrid Self-attention Transformer for Efficient Image
Restoration [6.611849560359801]
We present Dual-former, which combines the powerful global modeling ability of self-attention modules and the local modeling ability of convolutions in an overall architecture.
Experiments demonstrate that Dual-former achieves a 1.91dB gain over the state-of-the-art MAXIM method on the Indoor dataset for single image dehazing.
For single image deraining, it exceeds the SOTA method by 0.1dB PSNR on the average results of five datasets with only 21.5% GFLOPs.
arXiv Detail & Related papers (2022-10-03T16:39:21Z) - Towards Faster and Stabilized GAN Training for High-fidelity Few-shot
Image Synthesis [21.40315235087551]
We propose a light-weight GAN structure that gains superior quality on 1024*1024 resolution.
We show our model's superior performance compared to the state-of-the-art StyleGAN2, when data and computing budget are limited.
arXiv Detail & Related papers (2021-01-12T22:02:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.