Related papers: One Attention, One Scale: Phase-Aligned Rotary Positional Embeddings for Mixed-Resolution Diffusion Transformer

One Attention, One Scale: Phase-Aligned Rotary Positional Embeddings for Mixed-Resolution Diffusion Transformer

URL: http://arxiv.org/abs/2511.19778v1
Date: Mon, 24 Nov 2025 23:10:15 GMT
Title: One Attention, One Scale: Phase-Aligned Rotary Positional Embeddings for Mixed-Resolution Diffusion Transformer
Authors: Haoyu Wu, Jingyi Xu, Qiaomu Miao, Dimitris Samaras, Hieu Le,
Abstract summary: Cross-Resolution Phase-Aligned Attention (CRPA) is a training-free drop-in fix that eliminates this failure at its source.<n>CRPA is fully compatible with pretrained DiTs, stabilizes all heads and layers uniformly.<n>We demonstrate that CRPA enables high-fidelity and efficient mixed-resolution generation, outperforming previous state-of-the-art methods on image and video generation.
Score: 48.30024190686566
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We identify a core failure mode that occurs when using the usual linear interpolation on rotary positional embeddings (RoPE) for mixed-resolution denoising with Diffusion Transformers. When tokens from different spatial grids are mixed, the attention mechanism collapses. The issue is structural. Linear coordinate remapping forces a single attention head to compare RoPE phases sampled at incompatible rates, creating phase aliasing that destabilizes the score landscape. Pretrained DiTs are especially brittle-many heads exhibit extremely sharp, periodic phase selectivity-so even tiny cross-rate inconsistencies reliably cause blur, artifacts, or full collapse. To this end, our main contribution is Cross-Resolution Phase-Aligned Attention (CRPA), a training-free drop-in fix that eliminates this failure at its source. CRPA modifies only the RoPE index map for each attention call: all Q/K positions are expressed on the query's stride so that equal physical distances always induce identical phase increments. This restores the precise phase patterns that DiTs rely on. CRPA is fully compatible with pretrained DiTs, stabilizes all heads and layers uniformly. We demonstrate that CRPA enables high-fidelity and efficient mixed-resolution generation, outperforming previous state-of-the-art methods on image and video generation.

Related papers

From Circuits to Dynamics: Understanding and Stabilizing Failure in 3D Diffusion Transformers [25.11520870904882]
3D diffusion transformers exhibit a catastrophic mode of failure.<n>We call this phenomenon Meltdown.<n>We introduce PowerRemap, a test-time control that stabilizes sparse point-cloud conditioning.
arXiv Detail & Related papers (2026-02-11T18:42:05Z)
Rotary Positional Embeddings as Phase Modulation: Theoretical Bounds on the RoPE Base for Long-Context Transformers [0.5414847001704249]
Rotary positional embeddings (RoPE) are widely used in large language models to encode token positions.<n>We derive principled lower bounds on the RoPE base parameter that are necessary to preserve positional coherence over a target context length.<n>We extend this analysis to deep transformers, showing that repeated rotary modulation across layers compounds angular misalignment.<n>Together, the lower and upper bounds define a precision- and depth-dependent feasibility region a Goldilocks zone for long-context transformers.
arXiv Detail & Related papers (2026-02-11T15:50:07Z)
Unifying Heterogeneous Degradations: Uncertainty-Aware Diffusion Bridge Model for All-in-One Image Restoration [39.5698877093219]
We propose an Uncertainty-Aware Diffusion Bridge Model (UDBM) for image restoration.<n>UDBM reformulates AiOIR as a transport problem steered by pixel-wise uncertainty.<n>It achieves state-of-the-art performance across diverse restoration tasks within a single inference step.
arXiv Detail & Related papers (2026-01-29T12:02:42Z)
Universal composite phase gates with tunable target phase [0.0]
We present a systematic method for constructing universal composite phase gates with a continuously tunable target phase.<n> Numerical simulations in a standard two-level model confirm high-order error suppression and demonstrate broad, flat high-fidelity plateaus over wide ranges of simultaneous pulse-area and detuning errors.
arXiv Detail & Related papers (2026-01-20T12:53:05Z)
NeuralRemaster: Phase-Preserving Diffusion for Structure-Aligned Generation [88.09231548061295]
Phase-Preserving Diffusion -PD is a model-agnostic reformulation of the diffusion process.<n>-PD preserves input phase while randomizing magnitude, enabling structure-aligned generation without architectural changes.<n>-PD adds no inference-time cost and is compatible with any diffusion model for images or videos.
arXiv Detail & Related papers (2025-12-04T18:59:18Z)
PAS: A Training-Free Stabilizer for Temporal Encoding in Video LLMs [57.790910044227935]
Video LLMs suffer from temporal inconsistency: small shifts in frame timing can flip attention and suppress relevant frames.<n>We present Phase Aggregated Smoothing (PAS), a training-free mechanism that applies small opposed phase offsets across heads and then aggregates their outputs.<n>Our analysis shows that the RoPE rotated logit can be approximated as a content dot product scaled by a time kernel; smoothing this kernel yields Lipschitz stability of attention to small temporal shifts; multi phase averaging attenuates high frequency ripples while preserving per-head spectra under Nyquist-valid sampling.
arXiv Detail & Related papers (2025-11-14T05:56:47Z)
Morphing Through Time: Diffusion-Based Bridging of Temporal Gaps for Robust Alignment in Change Detection [51.56484100374058]
We introduce a modular pipeline that improves spatial and temporal robustness without altering existing change detection networks.<n>A diffusion module synthesizes intermediate morphing frames that bridge large appearance gaps, enabling RoMa to estimate stepwise correspondences.<n>Experiments on LEVIR-CD, WHU-CD, and DSIFN-CD show consistent gains in both registration accuracy and downstream change detection.
arXiv Detail & Related papers (2025-11-11T08:40:28Z)
Holographic Transformers for Complex-Valued Signal Processing: Integrating Phase Interference into Self-Attention [19.574464511943074]
We introduce the Holographic Transformer, a physics-inspired architecture that incorporates wave interference principles into self-attention.<n>A dual-headed decoder simultaneously reconstructs the input and predicts task outputs, preventing phase collapse when losses prioritize magnitude over phase.<n>Experiments on PolSAR image classification and wireless channel prediction show strong performance, achieving high classification accuracy and F1 scores, low regression error, and increased robustness to phase perturbations.
arXiv Detail & Related papers (2025-09-14T15:24:43Z)
Semi-Supervised Coupled Thin-Plate Spline Model for Rotation Correction and Beyond [84.56978780892783]
We propose CoupledTPS, which iteratively couples multiple TPS with limited control points into a more flexible and powerful transformation. In light of the laborious annotation cost, we develop a semi-supervised learning scheme to improve warping quality by exploiting unlabeled data. Experiments demonstrate the superiority and universality of CoupledTPS over the existing state-of-the-art solutions for rotation correction.
arXiv Detail & Related papers (2024-01-24T13:03:28Z)
Adaptive Multi-step Refinement Network for Robust Point Cloud Registration [82.64560249066734]
Point Cloud Registration estimates the relative rigid transformation between two point clouds of the same scene.<n>We propose an adaptive multi-step refinement network that refines the registration quality at each step by leveraging the information from the preceding step.<n>Our method achieves state-of-the-art performance on both the 3DMatch/3DLoMatch and KITTI benchmarks.
arXiv Detail & Related papers (2023-12-05T18:59:41Z)
Improving Misaligned Multi-modality Image Fusion with One-stage Progressive Dense Registration [67.23451452670282]
Misalignments between multi-modality images pose challenges in image fusion. We propose a Cross-modality Multi-scale Progressive Dense Registration scheme. This scheme accomplishes the coarse-to-fine registration exclusively using a one-stage optimization.
arXiv Detail & Related papers (2023-08-22T03:46:24Z)
Rotation-Invariant Transformer for Point Cloud Matching [42.5714375149213]
We introduce RoITr, a Rotation-Invariant Transformer to cope with the pose variations in the point cloud matching task. We propose a global transformer with rotation-invariant cross-frame spatial awareness learned by the self-attention mechanism. RoITr surpasses the existing methods by at least 13 and 5 percentage points in terms of Inlier Ratio and Registration Recall.
arXiv Detail & Related papers (2023-03-14T20:55:27Z)
Is Perfect Filtering Enough Leading to Perfect Phase Correction for dMRI data? [0.0]
We argue that even a perfect filter is insufficient for phase correction because the correction procedures are incapable of distinguishing sign-symbols of noise. We propose a calibration procedure that could conveniently distinguish noise sign symbols.
arXiv Detail & Related papers (2021-06-13T13:38:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.