Related papers: SDiT: Semantic Region-Adaptive for Diffusion Transformers

SDiT: Semantic Region-Adaptive for Diffusion Transformers

URL: http://arxiv.org/abs/2601.12283v1
Date: Sun, 18 Jan 2026 06:43:36 GMT
Title: SDiT: Semantic Region-Adaptive for Diffusion Transformers
Authors: Bowen Lin, Fanjiang Ye, Yihua Liu, Zhenghui Guo, Boyuan Zhang, Weijian Zheng, Yufan Xu, Tiancheng Xing, Yuke Wang, Chengming Zhang,
Abstract summary: Diffusion Transformers (DiTs) achieve state-of-the-art performance in text-to-image synthesis but remain computationally expensive due to the iterative nature of denoising and the quadratic cost of global attention.<n>We propose SDiT, a Semantic Region-Adaptive Diffusion Transformer that allocates computation according to regional complexity.
Score: 4.7254170106792035
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion Transformers (DiTs) achieve state-of-the-art performance in text-to-image synthesis but remain computationally expensive due to the iterative nature of denoising and the quadratic cost of global attention. In this work, we observe that denoising dynamics are spatially non-uniform-background regions converge rapidly while edges and textured areas evolve much more actively. Building on this insight, we propose SDiT, a Semantic Region-Adaptive Diffusion Transformer that allocates computation according to regional complexity. SDiT introduces a training-free framework combining (1) semantic-aware clustering via fast Quickshift-based segmentation, (2) complexity-driven regional scheduling to selectively update informative areas, and (3) boundary-aware refinement to maintain spatial coherence. Without any model retraining or architectural modification, SDiT achieves up to 3.0x acceleration while preserving nearly identical perceptual and semantic quality to full-attention inference.

Related papers

Fast-SAM3D: 3Dfy Anything in Images but Faster [65.17322167628367]
SAM3D enables scalable, open-world 3D reconstruction from complex scenes, yet its deployment is hindered by prohibitive inference latency.<n>We present textbfFast-SAM3D, a training-free framework that aligns computation with instantaneous generation complexity.
arXiv Detail & Related papers (2026-02-05T04:27:59Z)
Joint Semantic and Rendering Enhancements in 3D Gaussian Modeling with Anisotropic Local Encoding [86.55824709875598]
We propose a joint enhancement framework for 3D semantic Gaussian modeling that synergizes both semantic and rendering branches.<n>Unlike conventional point cloud shape encoding, we introduce an anisotropic 3D Gaussian Chebyshev descriptor to capture fine-grained 3D shape details.<n>We employ a cross-scene knowledge transfer module to continuously update learned shape patterns, enabling faster convergence and robust representations.
arXiv Detail & Related papers (2026-01-05T18:33:50Z)
Fourier-RWKV: A Multi-State Perception Network for Efficient Image Dehazing [26.57698394898644]
We propose a novel dehazing framework based on a Multi-State Perception paradigm.<n>Fourier-RWKV delivers state-of-the-art performance across diverse haze scenarios.
arXiv Detail & Related papers (2025-12-09T01:35:56Z)
Adaptive Mesh-Quantization for Neural PDE Solvers [51.26961483962011]
Graph Neural Networks can handle the irregular meshes required for complex geometries and boundary conditions, but still apply uniform computational effort across all nodes.<n>We propose Adaptive Mesh Quantization: spatially adaptive quantization across mesh node, edge, and cluster features, dynamically adjusting the bit-width used by a quantized model.<n>We demonstrate our framework's effectiveness by integrating it with two state-of-the-art models, MP-PDE and GraphViT, to evaluate performance across multiple tasks.
arXiv Detail & Related papers (2025-11-23T14:47:24Z)
Complementary and Contrastive Learning for Audio-Visual Segmentation [74.11434759171199]
We present Complementary and Contrastive Transformer (CCFormer), a novel framework adept at processing both local and global information.<n>Our method sets new state-of-the-art benchmarks across the S4, MS3 and AVSS datasets.
arXiv Detail & Related papers (2025-10-11T06:36:59Z)
Bidirectional Feature-aligned Motion Transformation for Efficient Dynamic Point Cloud Compression [97.66080040613726]
We propose a Bidirectional Feature-aligned Motion Transformation (Bi-FMT) framework that implicitly models motion in the feature space.<n>Bi-FMT aligns features across both past and future frames to produce temporally consistent latent representations.<n>We show Bi-FMT surpasses D-DPCC and AdaDPCC in both compression efficiency and runtime.
arXiv Detail & Related papers (2025-09-18T03:51:06Z)
Deep Equilibrium Convolutional Sparse Coding for Hyperspectral Image Denoising [16.405355853358202]
Hyperspectral images (HSIs) play a crucial role in remote sensing but are often degraded by complex noise patterns.<n> Ensuring the physical property of the denoised HSIs is vital for robust HSI denoising, giving the rise of deep unfolding-based methods.<n>We propose a Deep Equilibrium Convolutional Sparse Coding (DECSC) framework that unifies local spatial-spectral correlations, nonlocal spatial self-similarities, and global spatial consistency.
arXiv Detail & Related papers (2025-08-21T13:35:11Z)
CoT-lized Diffusion: Let's Reinforce T2I Generation Step-by-step [37.449561703903505]
CoT-Diff is a framework that brings step-by-step CoT-style reasoning into T2I generation.<n>CoT-Diff tightly integrates Multimodal Large Language Model (MLLM)-driven 3D layout planning with the diffusion process.<n> Experiments on 3D Scene benchmarks show that CoT-Diff significantly improves spatial alignment and compositional fidelity.
arXiv Detail & Related papers (2025-07-06T16:17:32Z)
TOAST: Task-Oriented Adaptive Semantic Transmission over Dynamic Wireless Environments [3.3107717550009865]
TOAST (Task-Oriented Adaptive Semantic Transmission) is a unified framework designed to address the core challenge of multi-task optimization in wireless environments.<n>We formulate adaptive task balancing as a Markov decision process, employing deep reinforcement learning to dynamically adjust the trade-off between image reconstruction fidelity and semantic classification accuracy.<n>We integrate module-specific Low-Rank Adaptation (LoRA) mechanisms throughout our Swin Transformer-based joint source-channel coding architecture.
arXiv Detail & Related papers (2025-06-27T04:36:30Z)
TMT: Cross-domain Semantic Segmentation with Region-adaptive Transferability Estimation [27.208145888390117]
We propose a region-adaptive framework designed to enhance cross-domain representation learning through transferability guidance.<n>First, we dynamically partition the image into coherent regions, grouped by structural and semantic similarity, and estimates their domain transferability at a localized level.<n>Then, we incorporate region-level transferability maps directly into the self-attention mechanism of ViTs, allowing the model to adaptively focus attention on areas with lower transferability and higher semantic uncertainty.
arXiv Detail & Related papers (2025-04-08T07:53:51Z)
ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer [58.49950218437718]
We present ReCoM, an efficient framework for generating high-fidelity and generalizable human body motions synchronized with speech.<n>The core innovation lies in the Recurrent Embedded Transformer (RET), which integrates Dynamic Embedding Regularization (DER) into a Vision Transformer (ViT) core architecture.<n>To enhance model robustness, we incorporate the proposed DER strategy, which equips the model with dual capabilities of noise resistance and cross-domain generalization.
arXiv Detail & Related papers (2025-03-27T16:39:40Z)
Enhancing Low-Cost Video Editing with Lightweight Adaptors and Temporal-Aware Inversion [28.38307253613529]
We propose a framework that integrates temporal-spatial and semantic consistency with Baliteral DDIM inversion.<n>Our method significantly improves perceptual quality, text-image alignment, and temporal coherence, as demonstrated on the MSR-VTT dataset.
arXiv Detail & Related papers (2025-01-08T16:41:31Z)
VoxNeRF: Bridging Voxel Representation and Neural Radiance Fields for Enhanced Indoor View Synthesis [73.50359502037232]
VoxNeRF is a novel approach to enhance the quality and efficiency of neural indoor reconstruction and novel view synthesis.<n>We propose an efficient voxel-guided sampling technique that allocates computational resources to selectively the most relevant segments of rays.<n>Our approach is validated with extensive experiments on ScanNet and ScanNet++.
arXiv Detail & Related papers (2023-11-09T11:32:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.