Related papers: SD3.5-Flash: Distribution-Guided Distillation of Generative Flows

SD3.5-Flash: Distribution-Guided Distillation of Generative Flows

URL: http://arxiv.org/abs/2509.21318v1
Date: Thu, 25 Sep 2025 16:07:38 GMT
Title: SD3.5-Flash: Distribution-Guided Distillation of Generative Flows
Authors: Hmrishav Bandyopadhyay, Rahim Entezari, Jim Scott, Reshinth Adithyan, Yi-Zhe Song, Varun Jampani,
Abstract summary: We present SD3.5-Flash, an efficient few-step distillation framework that brings high-quality image generation to consumer devices.<n>We introduce two key innovations: "timestep sharing" to reduce gradient noise and "split-timestep fine-tuning" to improve prompt alignment.<n>This democratizes access across the full spectrum of devices, from mobile phones to desktop computers.
Score: 87.45964232927945
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present SD3.5-Flash, an efficient few-step distillation framework that brings high-quality image generation to accessible consumer devices. Our approach distills computationally prohibitive rectified flow models through a reformulated distribution matching objective tailored specifically for few-step generation. We introduce two key innovations: "timestep sharing" to reduce gradient noise and "split-timestep fine-tuning" to improve prompt alignment. Combined with comprehensive pipeline optimizations like text encoder restructuring and specialized quantization, our system enables both rapid generation and memory-efficient deployment across different hardware configurations. This democratizes access across the full spectrum of devices, from mobile phones to desktop computers. Through extensive evaluation including large-scale user studies, we demonstrate that SD3.5-Flash consistently outperforms existing few-step methods, making advanced generative AI truly accessible for practical deployment.

Related papers

SnapGen++: Unleashing Diffusion Transformers for Efficient High-Fidelity Image Generation on Edge Devices [72.0937240883345]
Recent advances in diffusion transformers (DiTs) have set new standards in image generation, yet remain impractical for on-device deployment.<n>We present an efficient DiT framework tailored for mobile and edge devices that achieves transformer-level generation quality under strict resource constraints.
arXiv Detail & Related papers (2026-01-13T07:46:46Z)
MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning [91.90342432541138]
Scaling up model size and training data has advanced foundation models for instance-level perception.<n>High computational cost limits adoption on resource-constrained platforms.<n>We introduce a new benchmark for efficient segmentation on both high-performance computing platforms and mobile devices.
arXiv Detail & Related papers (2025-10-16T18:00:00Z)
SwiftVideo: A Unified Framework for Few-Step Video Generation through Trajectory-Distribution Alignment [76.60024640625478]
Diffusion-based or flow-based models have achieved significant progress in video synthesis but require multiple iterative sampling steps.<n>We propose a unified and stable distillation framework that combines the advantages of trajectory-preserving and distribution-matching strategies.<n>Our method maintains high-quality video generation while substantially reducing the number of inference steps.
arXiv Detail & Related papers (2025-08-08T07:26:34Z)
Taming Diffusion Transformer for Efficient Mobile Video Generation in Seconds [91.56929670753226]
Diffusion Transformers (DiT) have shown strong performance in video generation tasks, but their high computational cost makes them impractical for resource-constrained devices like smartphones.<n>We propose a series of novel optimizations to significantly accelerate video generation and enable practical deployment on mobile platforms.
arXiv Detail & Related papers (2025-07-17T17:59:10Z)
Improving Progressive Generation with Decomposable Flow Matching [50.63174319509629]
Decomposable Flow Matching (DFM) is a simple and effective framework for the progressive generation of visual media.<n>On Imagenet-1k 512px, DFM achieves 35.2% improvements in FDD scores over the base architecture and 26.4% over the best-performing baseline.
arXiv Detail & Related papers (2025-06-24T17:58:02Z)
On-device Sora: Enabling Training-Free Diffusion-based Text-to-Video Generation for Mobile Devices [3.034710104407876]
We present On-device Sora, the first model training-free solution for diffusion-based on-device text-to-video generation.<n>We implement On-device Sora on the iPhone 15 Pro, and the experimental evaluations show that it is capable of generating high-quality videos on the device.
arXiv Detail & Related papers (2025-03-31T07:19:09Z)
Acc3D: Accelerating Single Image to 3D Diffusion Models via Edge Consistency Guided Score Distillation [49.202383675543466]
We present Acc3D to tackle the challenge of accelerating the diffusion process to generate 3D models from single images.<n>To derive high-quality reconstructions through few-step inferences, we emphasize the critical issue of regularizing the learning of score function in states of random noise.
arXiv Detail & Related papers (2025-03-20T09:18:10Z)
On-device Sora: Enabling Training-Free Diffusion-based Text-to-Video Generation for Mobile Devices [3.034710104407876]
We present On-device Sora, the first model training-free solution for diffusion-based on-device text-to-video generation.<n>We implement On-device Sora on the iPhone 15 Pro, and the experimental evaluations show that it is capable of generating high-quality videos on the device.
arXiv Detail & Related papers (2025-02-05T05:42:29Z)
E2ED^2:Direct Mapping from Noise to Data for Enhanced Diffusion Models [15.270657838960114]
Diffusion models have established themselves as the de facto primary paradigm in visual generative modeling.<n>We present a novel end-to-end learning paradigm that establishes direct optimization from the final generated samples to initial noises.<n>Our method achieves substantial performance gains in terms of Fr'eche't Inception Distance (FID) and CLIP score, even with fewer sampling steps.
arXiv Detail & Related papers (2024-12-30T16:06:31Z)
FlashOcc: Fast and Memory-Efficient Occupancy Prediction via Channel-to-Height Plugin [32.172269679513285]
FlashOCC consolidates rapid and memory-efficient occupancy prediction. Channel-to-height transformation is introduced to lift the output logits from the BEV into the 3D space. Results substantiate the superiority of our plug-and-play paradigm over previous state-of-the-art methods.
arXiv Detail & Related papers (2023-11-18T15:28:09Z)
Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers [115.90778814368703]
Our objective is language-based search of large-scale image and video datasets. For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales. An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings.
arXiv Detail & Related papers (2021-03-30T17:57:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.