Related papers: DoPE: Denoising Rotary Position Embedding

DoPE: Denoising Rotary Position Embedding

URL: http://arxiv.org/abs/2511.09146v1
Date: Thu, 13 Nov 2025 01:35:46 GMT
Title: DoPE: Denoising Rotary Position Embedding
Authors: Jing Xiong, Liyang Fan, Hui Shen, Zunhai Su, Min Yang, Lingpeng Kong, Ngai Wong,
Abstract summary: Rotary Position Embedding (RoPE) in Transformer models has inherent limits that weaken length.<n>We reinterpret the attention map with positional encoding as a noisy feature map, and propose Denoising Positional extrapolation page (DoPE)<n>DoPE is a training-free method based on truncated matrix entropy to detect outlier frequency bands in the feature map.
Score: 60.779039511252584
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Rotary Position Embedding (RoPE) in Transformer models has inherent limits that weaken length extrapolation. We reinterpret the attention map with positional encoding as a noisy feature map, and propose Denoising Positional Encoding (DoPE), a training-free method based on truncated matrix entropy to detect outlier frequency bands in the feature map. Leveraging the noise characteristics of the feature map, we further reparameterize it with a parameter-free Gaussian distribution to achieve robust extrapolation. Our method theoretically reveals the underlying cause of the attention sink phenomenon and its connection to truncated matrix entropy. Experiments on needle-in-a-haystack and many-shot in-context learning tasks demonstrate that DoPE significantly improves retrieval accuracy and reasoning stability across extended contexts (up to 64K tokens). The results show that the denoising strategy for positional embeddings effectively mitigates attention sinks and restores balanced attention patterns, providing a simple yet powerful solution for improving length generalization. Our project page is Project: https://The-physical-picture-of-LLMs.github.io

Related papers

Noise-Robust Tiny Object Localization with Flows [63.60972031108944]
We propose a noise-robust localization framework leveraging normalizing flows for flexible error modeling and uncertainty-guided optimization.<n>Our method captures complex, non-Gaussian prediction distributions through flow-based error modeling, enabling robust learning under noisy supervision.<n>An uncertainty-aware gradient modulation mechanism further suppresses learning from high-uncertainty, noise-prone samples, mitigating overfitting while stabilizing training.
arXiv Detail & Related papers (2026-01-02T09:16:55Z)
Do traveling waves make good positional encodings? [44.55744608160896]
We propose RollPE, a novel positional encoding mechanism based on traveling waves.<n>We show it significantly outperforms traditional absolute positional embeddings.<n>We derive a mathematical equivalence of RollPE to a particular configuration of RoPE.
arXiv Detail & Related papers (2025-11-11T14:32:45Z)
Context-aware Rotary Position Embedding [0.0]
Rotary Positional Embeddings (RoPE) have become a widely adopted solution due to their compatibility with relative position encoding and computational efficiency.<n>We propose CARoPE (Context-Aware Rotary Positional Embedding), a novel generalization of RoPE that dynamically generates head-specific frequency patterns conditioned on token embeddings.<n>CaroPE consistently outperforms RoPE and other common positional encoding baselines, achieving significantly lower perplexity, even at longer context lengths.
arXiv Detail & Related papers (2025-07-30T20:32:19Z)
Robust and Noise-resilient Long-Term Prediction of Spatiotemporal Data Using Variational Mode Graph Neural Networks with 3D Attention [11.356542363919058]
This paper focuses on improving the robustness of long-term prediction using aaltemporal variation mode graphal network (VMGCN)<n>The deep learning network for this task relies on historical data inputs, yet real-time data can be corrupted by sensor noise.<n>We model this noise as independent and identically distributed (i.i.d.) Gaussian noise and incorporate it into the LargeST traffic volume dataset.
arXiv Detail & Related papers (2025-04-09T07:49:45Z)
Localized Gaussians as Self-Attention Weights for Point Clouds Correspondence [92.07601770031236]
We investigate semantically meaningful patterns in the attention heads of an encoder-only Transformer architecture.<n>We find that fixing the attention weights not only accelerates the training process but also enhances the stability of the optimization.
arXiv Detail & Related papers (2024-09-20T07:41:47Z)
RhythmFormer: Extracting Patterned rPPG Signals based on Periodic Sparse Attention [18.412642801957197]
RRhythm is a non-contact method for detecting physiological signals based on physiological videos.<n>This paper proposes a periodic attention mechanism based on temporal attention sparsity induced by periodicity.<n>It achieves state-of-the-art performance in both intra-dataset and cross-dataset evaluations.
arXiv Detail & Related papers (2024-02-20T07:56:02Z)
Tolerating Annotation Displacement in Dense Object Counting via Point Annotation Probability Map [25.203803417049528]
Counting objects in crowded scenes remains a challenge to computer vision. We present a learning target point annotation probability map (PAPM) We also propose an adaptively learned PAPM method (AL-PAPM)
arXiv Detail & Related papers (2023-07-29T04:46:21Z)
Attention Map Guided Transformer Pruning for Edge Device [98.42178656762114]
Vision transformer (ViT) has achieved promising success in both holistic and occluded person re-identification (Re-ID) tasks. We propose a novel attention map guided (AMG) transformer pruning method, which removes both redundant tokens and heads. Comprehensive experiments on Occluded DukeMTMC and Market-1501 demonstrate the effectiveness of our proposals.
arXiv Detail & Related papers (2023-04-04T01:51:53Z)
DINF: Dynamic Instance Noise Filter for Occluded Pedestrian Detection [0.0]
RCNN-based pedestrian detectors use rectangle regions to extract instance features. The number of severely overlapping objects and the number of slightly overlapping objects are unbalanced. An iterable dynamic instance noise filter (DINF) is proposed for the RCNN-based pedestrian detectors to improve the signal-noise ratio of the instance feature.
arXiv Detail & Related papers (2023-01-13T14:12:36Z)
Rethinking Spatial Invariance of Convolutional Networks for Object Counting [119.83017534355842]
We try to use locally connected Gaussian kernels to replace the original convolution filter to estimate the spatial position in the density map. Inspired by previous work, we propose a low-rank approximation accompanied with translation invariance to favorably implement the approximation of massive Gaussian convolution. Our methods significantly outperform other state-of-the-art methods and achieve promising learning of the spatial position of objects.
arXiv Detail & Related papers (2022-06-10T17:51:25Z)
Augmented Parallel-Pyramid Net for Attention Guided Pose-Estimation [90.28365183660438]
This paper proposes an augmented parallel-pyramid net with attention partial module and differentiable auto-data augmentation. We define a new pose search space where the sequences of data augmentations are formulated as a trainable and operational CNN component. Notably, our method achieves the top-1 accuracy on the challenging COCO keypoint benchmark and the state-of-the-art results on the MPII datasets.
arXiv Detail & Related papers (2020-03-17T03:52:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.