Related papers: Head-wise Adaptive Rotary Positional Encoding for Fine-Grained Image Generation

Head-wise Adaptive Rotary Positional Encoding for Fine-Grained Image Generation

URL: http://arxiv.org/abs/2510.10489v1
Date: Sun, 12 Oct 2025 07:46:28 GMT
Title: Head-wise Adaptive Rotary Positional Encoding for Fine-Grained Image Generation
Authors: Jiaye Li, Baoyou Chen, Hui Li, Zilong Dong, Jingdong Wang, Siyu Zhu,
Abstract summary: Rotary Position Embedding (RoPE) excels in 1D domains, but its application to image generation reveals significant limitations.<n>HaroPE is a head-wise adaptive extension that inserts a learnable linear transformation parameterized via singular value decomposition.<n>HaroPE consistently improves performance over strong RoPE baselines and other extensions.
Score: 35.66580960895196
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers rely on explicit positional encoding to model structure in data. While Rotary Position Embedding (RoPE) excels in 1D domains, its application to image generation reveals significant limitations such as fine-grained spatial relation modeling, color cues, and object counting. This paper identifies key limitations of standard multi-dimensional RoPE-rigid frequency allocation, axis-wise independence, and uniform head treatment-in capturing the complex structural biases required for fine-grained image generation. We propose HARoPE, a head-wise adaptive extension that inserts a learnable linear transformation parameterized via singular value decomposition (SVD) before the rotary mapping. This lightweight modification enables dynamic frequency reallocation, semantic alignment of rotary planes, and head-specific positional receptive fields while rigorously preserving RoPE's relative-position property. Extensive experiments on class-conditional ImageNet and text-to-image generation (Flux and MMDiT) demonstrate that HARoPE consistently improves performance over strong RoPE baselines and other extensions. The method serves as an effective drop-in replacement, offering a principled and adaptable solution for enhancing positional awareness in transformer-based image generative models.

Related papers

Computing a Characteristic Orientation for Rotation-Independent Image Analysis [0.0]
General Intensity Direction (GID) is a preprocessing method that improves rotation robustness without modifying the network architecture.<n>It transforms the image while preserving spatial structure, making it compatible with convolutional networks.<n> Experimental evaluation on the rotated MNIST dataset shows that the proposed method achieves higher accuracy than state-of-the-art rotation-invariant architectures.
arXiv Detail & Related papers (2026-02-24T14:08:12Z)
Untwisting RoPE: Frequency Control for Shared Attention in DiTs [84.14005261938284]
Positional encodings are essential to transformer-based generative models.<n>We show that Rotary Positional Embeddings (RoPE) naturally decomposes into frequency components with distinct positional sensitivities.<n>We introduce a method for selectively modulating RoPE frequency bands so that attention reflects semantic similarity rather than strict positional alignment.
arXiv Detail & Related papers (2026-02-04T20:01:59Z)
Selective Rotary Position Embedding [84.22998043041198]
We introduce textitSelective RoPE, an textitinput-dependent rotary embedding mechanism.<n>We show that softmax attention already performs a hidden form of these rotations on query-key pairs.<n>We validate our method by equipping gated transformers with textitSelective RoPE, demonstrating that its input-dependent rotations improve performance in language modeling.
arXiv Detail & Related papers (2025-11-21T16:50:00Z)
Rotation Equivariant Arbitrary-scale Image Super-Resolution [62.41329042683779]
The arbitrary-scale image super-resolution (ASISR) aims to achieve arbitrary-scale high-resolution recoveries from a low-resolution input image.<n>We make efforts to construct a rotation equivariant ASISR method in this study.
arXiv Detail & Related papers (2025-08-07T08:51:03Z)
Context-aware Rotary Position Embedding [0.0]
Rotary Positional Embeddings (RoPE) have become a widely adopted solution due to their compatibility with relative position encoding and computational efficiency.<n>We propose CARoPE (Context-Aware Rotary Positional Embedding), a novel generalization of RoPE that dynamically generates head-specific frequency patterns conditioned on token embeddings.<n>CaroPE consistently outperforms RoPE and other common positional encoding baselines, achieving significantly lower perplexity, even at longer context lengths.
arXiv Detail & Related papers (2025-07-30T20:32:19Z)
cIDIR: Conditioned Implicit Neural Representation for Regularized Deformable Image Registration [0.7022492404644499]
We propose cIDI, a novel deformable image registration framework based on Implicit Neural Representations (INRs)<n>CIDI is trained over a prior distribution of regularization hyper parameters, then optimized over them by using the segmentations masks as an observation.<n>It achieves high accuracy and robustness across the dataset.
arXiv Detail & Related papers (2025-07-17T09:48:53Z)
ComRoPE: Scalable and Robust Rotary Position Embedding Parameterized by Trainable Commuting Angle Matrices [25.99231204405503]
We propose ComRoPE, which generalizes Rotary Positional PE (RoPE) by defining it in terms of trainable commuting angle matrices.<n>We present two types of trainable commuting angle matrices as sufficient solutions to the RoPE equation.<n>Our framework shows versatility in generalizing to existing RoPE formulations and offering new insights for future positional encoding research.
arXiv Detail & Related papers (2025-06-04T09:10:02Z)
Feature Alignment with Equivariant Convolutions for Burst Image Super-Resolution [52.55429225242423]
We propose a novel framework for Burst Image Super-Resolution (BISR), featuring an equivariant convolution-based alignment.<n>This enables the alignment transformation to be learned via explicit supervision in the image domain and easily applied in the feature domain.<n>Experiments on BISR benchmarks show the superior performance of our approach in both quantitative metrics and visual quality.
arXiv Detail & Related papers (2025-03-11T11:13:10Z)
Hierarchical Semantic Regularization of Latent Spaces in StyleGANs [53.98170188547775]
We propose a Hierarchical Semantic Regularizer (HSR) which aligns the hierarchical representations learnt by the generator to corresponding powerful features learnt by pretrained networks on large amounts of data. HSR is shown to not only improve generator representations but also the linearity and smoothness of the latent style spaces, leading to the generation of more natural-looking style-edited images.
arXiv Detail & Related papers (2022-08-07T16:23:33Z)
DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation [56.514462874501675]
We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency. The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on. Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
arXiv Detail & Related papers (2022-07-13T11:12:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.