Related papers: Scale-DiT: Ultra-High-Resolution Image Generation with Hierarchical Local Attention

Scale-DiT: Ultra-High-Resolution Image Generation with Hierarchical Local Attention

URL: http://arxiv.org/abs/2510.16325v1
Date: Sat, 18 Oct 2025 03:15:26 GMT
Title: Scale-DiT: Ultra-High-Resolution Image Generation with Hierarchical Local Attention
Authors: Yuyao Zhang, Yu-Wing Tai,
Abstract summary: Scale-DiT is a new diffusion framework that introduces hierarchical local attention with low-resolution global guidance.<n>A lightweight LoRA adaptation bridges global and local pathways during denoising, ensuring consistency across structure and detail.<n>Experiments demonstrate that Scale-DiT achieves more than $2times$ faster inference and lower memory usage.
Score: 50.391914489898774
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Ultra-high-resolution text-to-image generation demands both fine-grained texture synthesis and globally coherent structure, yet current diffusion models remain constrained to sub-$1K \times 1K$ resolutions due to the prohibitive quadratic complexity of attention and the scarcity of native $4K$ training data. We present \textbf{Scale-DiT}, a new diffusion framework that introduces hierarchical local attention with low-resolution global guidance, enabling efficient, scalable, and semantically coherent image synthesis at ultra-high resolutions. Specifically, high-resolution latents are divided into fixed-size local windows to reduce attention complexity from quadratic to near-linear, while a low-resolution latent equipped with scaled positional anchors injects global semantics. A lightweight LoRA adaptation bridges global and local pathways during denoising, ensuring consistency across structure and detail. To maximize inference efficiency, we repermute token sequence in Hilbert curve order and implement a fused-kernel for skipping masked operations, resulting in a GPU-friendly design. Extensive experiments demonstrate that Scale-DiT achieves more than $2\times$ faster inference and lower memory usage compared to dense attention baselines, while reliably scaling to $4K \times 4K$ resolution without requiring additional high-resolution training data. On both quantitative benchmarks (FID, IS, CLIP Score) and qualitative comparisons, Scale-DiT delivers superior global coherence and sharper local detail, matching or outperforming state-of-the-art methods that rely on native 4K training. Taken together, these results highlight hierarchical local attention with guided low-resolution anchors as a promising and effective approach for advancing ultra-high-resolution image generation.

Related papers

UltraGen: High-Resolution Video Generation with Hierarchical Attention [62.99161115650818]
UltraGen is a novel video generation framework that enables i) efficient and ii) end-to-end native high-resolution video synthesis.<n>We show that UltraGen can effectively scale pre-trained low-resolution video models to 1080P and even 4K resolution for the first time.
arXiv Detail & Related papers (2025-10-21T16:23:21Z)
Dual-Stage Global and Local Feature Framework for Image Dehazing [7.536829470604261]
We propose a novel framework, termed the Streamlined Global and Local Features Combinator (SGLC)<n>Our approach is composed of two principal components: the Global Features Generator (GFG) and the Local Features Enhancer (LFE)<n> Experimental results on high-resolution datasets reveal a considerable improvement in peak signal-to-noise ratio (PSNR) when employing SGLC.
arXiv Detail & Related papers (2025-08-28T09:03:48Z)
Deep Equilibrium Convolutional Sparse Coding for Hyperspectral Image Denoising [16.405355853358202]
Hyperspectral images (HSIs) play a crucial role in remote sensing but are often degraded by complex noise patterns.<n> Ensuring the physical property of the denoised HSIs is vital for robust HSI denoising, giving the rise of deep unfolding-based methods.<n>We propose a Deep Equilibrium Convolutional Sparse Coding (DECSC) framework that unifies local spatial-spectral correlations, nonlocal spatial self-similarities, and global spatial consistency.
arXiv Detail & Related papers (2025-08-21T13:35:11Z)
Minimal High-Resolution Patches Are Sufficient for Whole Slide Image Representation via Cascaded Dual-Scale Reconstruction [13.897013242536849]
Whole-slide image (WSI) analysis remains challenging due to gigapixel scale and sparsely distributed diagnostic regions.<n>We propose a Cascaded Dual-Scale Reconstruction framework, demonstrating that only an average of 9 high-resolution patches per WSI are sufficient for robust slide-level representation.
arXiv Detail & Related papers (2025-08-03T08:01:30Z)
A Global-Local Cross-Attention Network for Ultra-high Resolution Remote Sensing Image Semantic Segmentation [1.833928124984226]
GLCANet is a lightweight segmentation framework designed for UHR remote sensing imagery.<n>A self-attention mechanism enhances long-range dependencies, refines global features, and preserves local details for better semantic consistency.<n>A masked cross-attention mechanism also adaptively fuses global-local features, selectively enhancing fine-grained details while exploiting global context to improve segmentation accuracy.
arXiv Detail & Related papers (2025-06-24T08:20:08Z)
C2D-ISR: Optimizing Attention-based Image Super-resolution from Continuous to Discrete Scales [6.700548615812325]
We propose a novel framework, textbfC2D-ISR, for optimizing attention-based image super-resolution models.<n>Our approach is based on a two-stage training methodology and a hierarchical encoding mechanism.<n>In addition, we generalize the hierarchical encoding mechanism with existing attention-based network structures.
arXiv Detail & Related papers (2025-03-17T21:52:18Z)
HiPrompt: Tuning-free Higher-Resolution Generation with Hierarchical MLLM Prompts [77.62320553269615]
HiPrompt is a tuning-free solution for higher-resolution image generation. hierarchical prompts offer both global and local guidance. generated images maintain coherent local and global semantics, structures, and textures with high definition.
arXiv Detail & Related papers (2024-09-04T17:58:08Z)
Low-Resolution Self-Attention for Semantic Segmentation [93.30597515880079]
We introduce the Low-Resolution Self-Attention (LRSA) mechanism to capture global context at a significantly reduced computational cost.<n>Our approach involves computing self-attention in a fixed low-resolution space regardless of the input image's resolution.<n>We demonstrate the effectiveness of our LRSA approach by building the LRFormer, a vision transformer with an encoder-decoder structure.
arXiv Detail & Related papers (2023-10-08T06:10:09Z)
Any-resolution Training for High-resolution Image Synthesis [55.19874755679901]
Generative models operate at fixed resolution, even though natural images come in a variety of sizes. We argue that every pixel matters and create datasets with variable-size images, collected at their native resolutions. We introduce continuous-scale training, a process that samples patches at random scales to train a new generator with variable output resolutions.
arXiv Detail & Related papers (2022-04-14T17:59:31Z)
Low Light Image Enhancement via Global and Local Context Modeling [164.85287246243956]
We introduce a context-aware deep network for low-light image enhancement. First, it features a global context module that models spatial correlations to find complementary cues over full spatial domain. Second, it introduces a dense residual block that captures local context with a relatively large receptive field.
arXiv Detail & Related papers (2021-01-04T09:40:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.