Related papers: Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers

Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers

URL: http://arxiv.org/abs/2507.08422v1
Date: Fri, 11 Jul 2025 09:07:43 GMT
Title: Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers
Authors: Wongi Jeong, Kyungryeol Lee, Hoigi Seo, Se Young Chun,
Abstract summary: Region-Adaptive Latent Upsampling (RALU) is a training-free framework that accelerates inference along spatial dimension.<n>RALU performs mixed-resolution sampling across three stages: 1) low-resolution denoising latent diffusion to efficiently capture global semantic structure, 2) region-adaptive upsampling on specific regions prone to artifacts at full-resolution, and 3) all latent upsampling at full-resolution for detail refinement.<n>Our method significantly reduces computation while preserving image quality by achieving up to 7.0$times$ speed-up on FLUX and 3.0$times$ on Stable Diffusion 3 with minimal degradation.
Score: 9.875073051988057
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Diffusion transformers have emerged as an alternative to U-net-based diffusion models for high-fidelity image and video generation, offering superior scalability. However, their heavy computation remains a major obstacle to real-world deployment. Existing acceleration methods primarily exploit the temporal dimension such as reusing cached features across diffusion timesteps. Here, we propose Region-Adaptive Latent Upsampling (RALU), a training-free framework that accelerates inference along spatial dimension. RALU performs mixed-resolution sampling across three stages: 1) low-resolution denoising latent diffusion to efficiently capture global semantic structure, 2) region-adaptive upsampling on specific regions prone to artifacts at full-resolution, and 3) all latent upsampling at full-resolution for detail refinement. To stabilize generations across resolution transitions, we leverage noise-timestep rescheduling to adapt the noise level across varying resolutions. Our method significantly reduces computation while preserving image quality by achieving up to 7.0$\times$ speed-up on FLUX and 3.0$\times$ on Stable Diffusion 3 with minimal degradation. Furthermore, RALU is complementary to existing temporal accelerations such as caching methods, thus can be seamlessly integrated to further reduce inference latency without compromising generation quality.

Related papers

From Sketch to Fresco: Efficient Diffusion Transformer with Progressive Resolution [11.05647700476321]
Diffusion Transformers achieve impressive generative quality but remain expensive due to iterative sampling.<n>We propose textbfFresco, a dynamic resolution framework that unifies re-noise and global structure across stages with progressive upsampling.
arXiv Detail & Related papers (2026-01-12T12:15:30Z)
BADiff: Bandwidth Adaptive Diffusion Model [55.10134744772338]
Traditional diffusion models produce high-fidelity images by performing a fixed number of denoising steps, regardless of downstream transmission limitations.<n>In practical cloud-to-device scenarios, limited bandwidth often necessitates heavy compression, leading to loss of fine textures and wasted computation.<n>We introduce a joint end-to-end training strategy where the diffusion model is conditioned on a target quality level derived from the available bandwidth.
arXiv Detail & Related papers (2025-10-24T11:50:03Z)
RAPID^3: Tri-Level Reinforced Acceleration Policies for Diffusion Transformer [86.57077884971478]
Diffusion Transformers (DiTs) excel at visual generation yet remain hampered by slow sampling.<n>We introduce RAPID3: Tri-Level Reinforced Acceleration Policies for Diffusion Transformers.<n>It delivers image-wise acceleration with zero updates to the base generator.<n>It achieves nearly 3x faster sampling with competitive generation quality.
arXiv Detail & Related papers (2025-09-26T13:20:52Z)
Single-Step Latent Consistency Model for Remote Sensing Image Super-Resolution [7.920423405957888]
We propose a novel single-step diffusion approach designed to enhance both efficiency and visual quality in RSISR tasks.<n>The proposed LCMSR reduces the iterative steps of traditional diffusion models from 50-1000 or more to just a single step.<n> Experimental results demonstrate that LCMSR effectively balances efficiency and performance, achieving inference times comparable to non-diffusion models.
arXiv Detail & Related papers (2025-03-25T09:56:21Z)
Training-free Diffusion Acceleration with Bottleneck Sampling [37.9135035506567]
Bottleneck Sampling is a training-free framework that leverages low-resolution priors to reduce computational overhead while preserving output fidelity.<n>It accelerates inference by up to 3$times$ for image generation and 2.5$times$ for video generation, all while maintaining output quality comparable to the standard full-resolution sampling process.
arXiv Detail & Related papers (2025-03-24T17:59:02Z)
Region-Adaptive Sampling for Diffusion Transformers [23.404921023113324]
RAS dynamically assigns different sampling ratios to regions within an image based on the focus of the DiT model.<n>We evaluate RAS on Stable Diffusion 3 and Lumina-Next-T2I, achieving speedups up to 2.36x and 2.51x, respectively, with minimal degradation in generation quality.
arXiv Detail & Related papers (2025-02-14T18:59:36Z)
Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution [151.1255837803585]
We propose a novel approach, pursuing Spatial Adaptation and Temporal Coherence (SATeCo) for video super-resolution. SATeCo pivots on learning spatial-temporal guidance from low-resolution videos to calibrate both latent-space high-resolution video denoising and pixel-space video reconstruction. Experiments conducted on the REDS4 and Vid4 datasets demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2024-03-25T17:59:26Z)
Efficient Diffusion Model for Image Restoration by Residual Shifting [63.02725947015132]
This study proposes a novel and efficient diffusion model for image restoration. Our method avoids the need for post-acceleration during inference, thereby avoiding the associated performance deterioration. Our method achieves superior or comparable performance to current state-of-the-art methods on three classical IR tasks.
arXiv Detail & Related papers (2024-03-12T05:06:07Z)
StableDreamer: Taming Noisy Score Distillation Sampling for Text-to-3D [88.66678730537777]
We present StableDreamer, a methodology incorporating three advances. First, we formalize the equivalence of the SDS generative prior and a simple supervised L2 reconstruction loss. Second, our analysis shows that while image-space diffusion contributes to geometric precision, latent-space diffusion is crucial for vivid color rendition.
arXiv Detail & Related papers (2023-12-02T02:27:58Z)
DiffuSeq-v2: Bridging Discrete and Continuous Text Spaces for Accelerated Seq2Seq Diffusion Models [58.450152413700586]
We introduce a soft absorbing state that facilitates the diffusion model in learning to reconstruct discrete mutations based on the underlying Gaussian space. We employ state-of-the-art ODE solvers within the continuous space to expedite the sampling process. Our proposed method effectively accelerates the training convergence by 4x and generates samples of similar quality 800x faster.
arXiv Detail & Related papers (2023-10-09T15:29:10Z)
ResShift: Efficient Diffusion Model for Image Super-resolution by Residual Shifting [70.83632337581034]
Diffusion-based image super-resolution (SR) methods are mainly limited by the low inference speed. We propose a novel and efficient diffusion model for SR that significantly reduces the number of diffusion steps. Our method constructs a Markov chain that transfers between the high-resolution image and the low-resolution image by shifting the residual.
arXiv Detail & Related papers (2023-07-23T15:10:02Z)
Towards Interpretable Video Super-Resolution via Alternating Optimization [115.85296325037565]
We study a practical space-time video super-resolution (STVSR) problem which aims at generating a high-framerate high-resolution sharp video from a low-framerate blurry video. We propose an interpretable STVSR framework by leveraging both model-based and learning-based methods.
arXiv Detail & Related papers (2022-07-21T21:34:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.