SD3.5-Flash: Distribution-Guided Distillation of Generative Flows
- URL: http://arxiv.org/abs/2509.21318v1
- Date: Thu, 25 Sep 2025 16:07:38 GMT
- Title: SD3.5-Flash: Distribution-Guided Distillation of Generative Flows
- Authors: Hmrishav Bandyopadhyay, Rahim Entezari, Jim Scott, Reshinth Adithyan, Yi-Zhe Song, Varun Jampani,
- Abstract summary: We present SD3.5-Flash, an efficient few-step distillation framework that brings high-quality image generation to consumer devices.<n>We introduce two key innovations: "timestep sharing" to reduce gradient noise and "split-timestep fine-tuning" to improve prompt alignment.<n>This democratizes access across the full spectrum of devices, from mobile phones to desktop computers.
- Score: 87.45964232927945
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present SD3.5-Flash, an efficient few-step distillation framework that brings high-quality image generation to accessible consumer devices. Our approach distills computationally prohibitive rectified flow models through a reformulated distribution matching objective tailored specifically for few-step generation. We introduce two key innovations: "timestep sharing" to reduce gradient noise and "split-timestep fine-tuning" to improve prompt alignment. Combined with comprehensive pipeline optimizations like text encoder restructuring and specialized quantization, our system enables both rapid generation and memory-efficient deployment across different hardware configurations. This democratizes access across the full spectrum of devices, from mobile phones to desktop computers. Through extensive evaluation including large-scale user studies, we demonstrate that SD3.5-Flash consistently outperforms existing few-step methods, making advanced generative AI truly accessible for practical deployment.
Related papers
- SnapGen++: Unleashing Diffusion Transformers for Efficient High-Fidelity Image Generation on Edge Devices [72.0937240883345]
Recent advances in diffusion transformers (DiTs) have set new standards in image generation, yet remain impractical for on-device deployment.<n>We present an efficient DiT framework tailored for mobile and edge devices that achieves transformer-level generation quality under strict resource constraints.
arXiv Detail & Related papers (2026-01-13T07:46:46Z) - MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning [91.90342432541138]
Scaling up model size and training data has advanced foundation models for instance-level perception.<n>High computational cost limits adoption on resource-constrained platforms.<n>We introduce a new benchmark for efficient segmentation on both high-performance computing platforms and mobile devices.
arXiv Detail & Related papers (2025-10-16T18:00:00Z) - SwiftVideo: A Unified Framework for Few-Step Video Generation through Trajectory-Distribution Alignment [76.60024640625478]
Diffusion-based or flow-based models have achieved significant progress in video synthesis but require multiple iterative sampling steps.<n>We propose a unified and stable distillation framework that combines the advantages of trajectory-preserving and distribution-matching strategies.<n>Our method maintains high-quality video generation while substantially reducing the number of inference steps.
arXiv Detail & Related papers (2025-08-08T07:26:34Z) - Taming Diffusion Transformer for Efficient Mobile Video Generation in Seconds [91.56929670753226]
Diffusion Transformers (DiT) have shown strong performance in video generation tasks, but their high computational cost makes them impractical for resource-constrained devices like smartphones.<n>We propose a series of novel optimizations to significantly accelerate video generation and enable practical deployment on mobile platforms.
arXiv Detail & Related papers (2025-07-17T17:59:10Z) - Improving Progressive Generation with Decomposable Flow Matching [50.63174319509629]
Decomposable Flow Matching (DFM) is a simple and effective framework for the progressive generation of visual media.<n>On Imagenet-1k 512px, DFM achieves 35.2% improvements in FDD scores over the base architecture and 26.4% over the best-performing baseline.
arXiv Detail & Related papers (2025-06-24T17:58:02Z) - On-device Sora: Enabling Training-Free Diffusion-based Text-to-Video Generation for Mobile Devices [3.034710104407876]
We present On-device Sora, the first model training-free solution for diffusion-based on-device text-to-video generation.<n>We implement On-device Sora on the iPhone 15 Pro, and the experimental evaluations show that it is capable of generating high-quality videos on the device.
arXiv Detail & Related papers (2025-03-31T07:19:09Z) - Acc3D: Accelerating Single Image to 3D Diffusion Models via Edge Consistency Guided Score Distillation [49.202383675543466]
We present Acc3D to tackle the challenge of accelerating the diffusion process to generate 3D models from single images.<n>To derive high-quality reconstructions through few-step inferences, we emphasize the critical issue of regularizing the learning of score function in states of random noise.
arXiv Detail & Related papers (2025-03-20T09:18:10Z) - On-device Sora: Enabling Training-Free Diffusion-based Text-to-Video Generation for Mobile Devices [3.034710104407876]
We present On-device Sora, the first model training-free solution for diffusion-based on-device text-to-video generation.<n>We implement On-device Sora on the iPhone 15 Pro, and the experimental evaluations show that it is capable of generating high-quality videos on the device.
arXiv Detail & Related papers (2025-02-05T05:42:29Z) - E2ED^2:Direct Mapping from Noise to Data for Enhanced Diffusion Models [15.270657838960114]
Diffusion models have established themselves as the de facto primary paradigm in visual generative modeling.<n>We present a novel end-to-end learning paradigm that establishes direct optimization from the final generated samples to initial noises.<n>Our method achieves substantial performance gains in terms of Fr'eche't Inception Distance (FID) and CLIP score, even with fewer sampling steps.
arXiv Detail & Related papers (2024-12-30T16:06:31Z) - FlashOcc: Fast and Memory-Efficient Occupancy Prediction via
Channel-to-Height Plugin [32.172269679513285]
FlashOCC consolidates rapid and memory-efficient occupancy prediction.
Channel-to-height transformation is introduced to lift the output logits from the BEV into the 3D space.
Results substantiate the superiority of our plug-and-play paradigm over previous state-of-the-art methods.
arXiv Detail & Related papers (2023-11-18T15:28:09Z) - Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with
Transformers [115.90778814368703]
Our objective is language-based search of large-scale image and video datasets.
For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales.
An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings.
arXiv Detail & Related papers (2021-03-30T17:57:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.