Related papers: Region-Adaptive Sampling for Diffusion Transformers

Region-Adaptive Sampling for Diffusion Transformers

URL: http://arxiv.org/abs/2502.10389v1
Date: Fri, 14 Feb 2025 18:59:36 GMT
Title: Region-Adaptive Sampling for Diffusion Transformers
Authors: Ziming Liu, Yifan Yang, Chengruidong Zhang, Yiqi Zhang, Lili Qiu, Yang You, Yuqing Yang,
Abstract summary: RAS dynamically assigns different sampling ratios to regions within an image based on the focus of the DiT model.<n>We evaluate RAS on Stable Diffusion 3 and Lumina-Next-T2I, achieving speedups up to 2.36x and 2.51x, respectively, with minimal degradation in generation quality.
Score: 23.404921023113324
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Diffusion models (DMs) have become the leading choice for generative tasks across diverse domains. However, their reliance on multiple sequential forward passes significantly limits real-time performance. Previous acceleration methods have primarily focused on reducing the number of sampling steps or reusing intermediate results, failing to leverage variations across spatial regions within the image due to the constraints of convolutional U-Net structures. By harnessing the flexibility of Diffusion Transformers (DiTs) in handling variable number of tokens, we introduce RAS, a novel, training-free sampling strategy that dynamically assigns different sampling ratios to regions within an image based on the focus of the DiT model. Our key observation is that during each sampling step, the model concentrates on semantically meaningful regions, and these areas of focus exhibit strong continuity across consecutive steps. Leveraging this insight, RAS updates only the regions currently in focus, while other regions are updated using cached noise from the previous step. The model's focus is determined based on the output from the preceding step, capitalizing on the temporal consistency we observed. We evaluate RAS on Stable Diffusion 3 and Lumina-Next-T2I, achieving speedups up to 2.36x and 2.51x, respectively, with minimal degradation in generation quality. Additionally, a user study reveals that RAS delivers comparable qualities under human evaluation while achieving a 1.6x speedup. Our approach makes a significant step towards more efficient diffusion transformers, enhancing their potential for real-time applications.

Related papers

Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers [9.875073051988057]
Region-Adaptive Latent Upsampling (RALU) is a training-free framework that accelerates inference along spatial dimension.<n>RALU performs mixed-resolution sampling across three stages: 1) low-resolution denoising latent diffusion to efficiently capture global semantic structure, 2) region-adaptive upsampling on specific regions prone to artifacts at full-resolution, and 3) all latent upsampling at full-resolution for detail refinement.<n>Our method significantly reduces computation while preserving image quality by achieving up to 7.0$times$ speed-up on FLUX and 3.0$times$ on Stable Diffusion 3 with minimal degradation.
arXiv Detail & Related papers (2025-07-11T09:07:43Z)
Single-Step Latent Consistency Model for Remote Sensing Image Super-Resolution [7.920423405957888]
We propose a novel single-step diffusion approach designed to enhance both efficiency and visual quality in RSISR tasks. The proposed LCMSR reduces the iterative steps of traditional diffusion models from 50-1000 or more to just a single step. Experimental results demonstrate that LCMSR effectively balances efficiency and performance, achieving inference times comparable to non-diffusion models.
arXiv Detail & Related papers (2025-03-25T09:56:21Z)
Sequential Controlled Langevin Diffusions [80.93988625183485]
Two popular methods are (1) Sequential Monte Carlo (SMC), where the transport is performed through successive densities via prescribed Markov chains and resampling steps, and (2) recently developed diffusion-based sampling methods, where a learned dynamical transport is used.<n>We present a principled framework for combining SMC with diffusion-based samplers by viewing both methods in continuous time and considering measures on path space.<n>This culminates in the new Sequential Controlled Langevin Diffusion (SCLD) sampling method, which is able to utilize the benefits of both methods and reaches improved performance on multiple benchmark problems, in many cases using only 10% of the training budget of previous diffusion-
arXiv Detail & Related papers (2024-12-10T00:47:10Z)
Effective Diffusion Transformer Architecture for Image Super-Resolution [63.254644431016345]
We design an effective diffusion transformer for image super-resolution (DiT-SR) In practice, DiT-SR leverages an overall U-shaped architecture, and adopts a uniform isotropic design for all the transformer blocks. We analyze the limitation of the widely used AdaLN, and present a frequency-adaptive time-step conditioning module.
arXiv Detail & Related papers (2024-09-29T07:14:16Z)
Amortized Posterior Sampling with Diffusion Prior Distillation [55.03585818289934]
Amortized Posterior Sampling is a novel variational inference approach for efficient posterior sampling in inverse problems.<n>Our method trains a conditional flow model to minimize the divergence between the variational distribution and the posterior distribution implicitly defined by the diffusion model.<n>Unlike existing methods, our approach is unsupervised, requires no paired training data, and is applicable to both Euclidean and non-Euclidean domains.
arXiv Detail & Related papers (2024-07-25T09:53:12Z)
DDLNet: Boosting Remote Sensing Change Detection with Dual-Domain Learning [5.932234366793244]
Change sensing (RSCD) aims to identify the changes of interest in a region by analyzing multi-temporal remote sensing images. Existing RSCD methods are devoted to contextual modeling in the spatial domain to enhance the changes of interest. We propose DNet, a RSCD network based on dual-domain learning (i.e. frequency and spatial domains)
arXiv Detail & Related papers (2024-06-19T14:54:09Z)
Self-Supervised Modality-Agnostic Pre-Training of Swin Transformers [0.7496510641958004]
We augment the Swin Transformer to learn from different medical imaging modalities, enhancing downstream performance. Our model, dubbed SwinFUSE, offers three key advantages: (i) it learns from both Computed Tomography (CT) and Magnetic Resonance Images (MRI) during pre-training, resulting in complementary feature representations; (ii) a domain-invariance module (DIM) that effectively highlights salient input regions, enhancing adaptability; (iii) exhibits remarkable generalizability, surpassing the confines of tasks it was initially pre-trained on.
arXiv Detail & Related papers (2024-05-21T13:28:32Z)
StableDreamer: Taming Noisy Score Distillation Sampling for Text-to-3D [88.66678730537777]
We present StableDreamer, a methodology incorporating three advances. First, we formalize the equivalence of the SDS generative prior and a simple supervised L2 reconstruction loss. Second, our analysis shows that while image-space diffusion contributes to geometric precision, latent-space diffusion is crucial for vivid color rendition.
arXiv Detail & Related papers (2023-12-02T02:27:58Z)
Generative Modeling with Phase Stochastic Bridges [49.4474628881673]
Diffusion models (DMs) represent state-of-the-art generative models for continuous inputs. We introduce a novel generative modeling framework grounded in textbfphase space dynamics Our framework demonstrates the capability to generate realistic data points at an early stage of dynamics propagation.
arXiv Detail & Related papers (2023-10-11T18:38:28Z)
DiffuSeq-v2: Bridging Discrete and Continuous Text Spaces for Accelerated Seq2Seq Diffusion Models [58.450152413700586]
We introduce a soft absorbing state that facilitates the diffusion model in learning to reconstruct discrete mutations based on the underlying Gaussian space. We employ state-of-the-art ODE solvers within the continuous space to expedite the sampling process. Our proposed method effectively accelerates the training convergence by 4x and generates samples of similar quality 800x faster.
arXiv Detail & Related papers (2023-10-09T15:29:10Z)
GCD-DDPM: A Generative Change Detection Model Based on Difference-Feature Guided DDPM [7.922421805234563]
Deep learning methods have recently shown great promise in bitemporal change detection (CD) This work proposes a generative change detection model called GCD-DDPM to directly generate CD maps. Experiments on four high-resolution CD datasets confirm the superior performance of the proposed GCD-DDPM.
arXiv Detail & Related papers (2023-06-06T05:51:50Z)
Gait Recognition in the Wild with Multi-hop Temporal Switch [81.35245014397759]
gait recognition in the wild is a more practical problem that has attracted the attention of the community of multimedia and computer vision. This paper presents a novel multi-hop temporal switch method to achieve effective temporal modeling of gait patterns in real-world scenes.
arXiv Detail & Related papers (2022-09-01T10:46:09Z)
CDN-MEDAL: Two-stage Density and Difference Approximation Framework for Motion Analysis [3.337126420148156]
We propose a novel, two-stage method of change detection with two convolutional neural networks. Our two-stage framework contains approximately 3.5K parameters in total but still maintains rapid convergence to intricate motion patterns.
arXiv Detail & Related papers (2021-06-07T16:39:42Z)
Lipreading using Temporal Convolutional Networks [57.41253104365274]
Current model for recognition of isolated words in-the-wild consists of a residual network and Bi-directional Gated Recurrent Unit layers. We address the limitations of this model and we propose changes which further improve its performance. Our proposed model results in an absolute improvement of 1.2% and 3.2%, respectively, in these datasets.
arXiv Detail & Related papers (2020-01-23T17:49:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.