CoPE: Clipped RoPE as A Scalable Free Lunch for Long Context LLMs
- URL: http://arxiv.org/abs/2602.05258v1
- Date: Thu, 05 Feb 2026 03:31:14 GMT
- Title: CoPE: Clipped RoPE as A Scalable Free Lunch for Long Context LLMs
- Authors: Haoran Li, Sucheng Ren, Alan Yuille, Feng Wang,
- Abstract summary: Rotary Positional Embedding (RoPE) is a key component of context scaling in Large Language Models (LLMs)<n>In this work, we unify these seemingly distinct objectives through a minimalist intervention, namely soft clipping lowfrequency components of RoPE.<n>CoPE not only eliminates OOD outliers and refines semantic signals, but also prevents spectral leakage caused by hard clipping.
- Score: 18.897130541385646
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Rotary Positional Embedding (RoPE) is a key component of context scaling in Large Language Models (LLMs). While various methods have been proposed to adapt RoPE to longer contexts, their guiding principles generally fall into two categories: (1) out-of-distribution (OOD) mitigation, which scales RoPE frequencies to accommodate unseen positions, and (2) Semantic Modeling, which posits that the attention scores computed with RoPE should always prioritize semantically similar tokens. In this work, we unify these seemingly distinct objectives through a minimalist intervention, namely CoPE: soft clipping lowfrequency components of RoPE. CoPE not only eliminates OOD outliers and refines semantic signals, but also prevents spectral leakage caused by hard clipping. Extensive experiments demonstrate that simply applying our soft clipping strategy to RoPE yields significant performance gains that scale up to 256k context length, validating our theoretical analysis and establishing CoPE as a new state-of-the-art for length generalization. Our code, data, and models are available at https://github.com/hrlics/CoPE.
Related papers
- MrRoPE: Mixed-radix Rotary Position Embedding [15.874568186540076]
MrRoPE (Mixed-radix RoPE) is a general encoding formulation based on a radix system conversion perspective.<n>We introduce two training-free extensions, MrRoPE-Uni and MrRoPE-Pro, which leverage uniform and progressive radix conversion strategies.<n>MrRoPE-Pro sustains over 85% recall in the 128K-context Needle-in-a-Haystack test and achieves more than double YaRN's accuracy.
arXiv Detail & Related papers (2026-01-28T05:09:54Z) - Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs [72.8830548005884]
Rotary Position Embeddings (RoPE) have become a standard for encoding sequence order in Large Language Models.<n>Standard implementations utilize only the real component of the complex-valued dot product for attention score calculation.<n>We propose an extension that re-incorporates this imaginary component.
arXiv Detail & Related papers (2025-12-08T12:59:54Z) - DoPE: Denoising Rotary Position Embedding [60.779039511252584]
Rotary Position Embedding (RoPE) in Transformer models has inherent limits that weaken length.<n>We reinterpret the attention map with positional encoding as a noisy feature map, and propose Denoising Positional extrapolation page (DoPE)<n>DoPE is a training-free method based on truncated matrix entropy to detect outlier frequency bands in the feature map.
arXiv Detail & Related papers (2025-11-12T09:32:35Z) - A Circular Argument : Does RoPE need to be Equivariant for Vision? [45.33536249657655]
We mathematically show RoPE to be one of the most general solutions for equivariant positional embedding in one-dimensional data.<n>We propose Spherical RoPE, a method analogous to Mixed RoPE, but assumes non-commutative generators.
arXiv Detail & Related papers (2025-11-11T15:47:54Z) - Positional Encoding via Token-Aware Phase Attention [45.855203550592734]
We show that Rotary Positional Embedding (RoPE) introduces an intrinsic distance-dependent bias in attention scores that limits RoPE's ability to model long-context.<n>This paper introduces Token-Aware Phase Attention (TAPA), a new positional encoding method that incorporates a learnable phase function into the attention mechanism.
arXiv Detail & Related papers (2025-09-16T03:53:32Z) - Context-aware Rotary Position Embedding [0.0]
Rotary Positional Embeddings (RoPE) have become a widely adopted solution due to their compatibility with relative position encoding and computational efficiency.<n>We propose CARoPE (Context-Aware Rotary Positional Embedding), a novel generalization of RoPE that dynamically generates head-specific frequency patterns conditioned on token embeddings.<n>CaroPE consistently outperforms RoPE and other common positional encoding baselines, achieving significantly lower perplexity, even at longer context lengths.
arXiv Detail & Related papers (2025-07-30T20:32:19Z) - LongRoPE2: Near-Lossless LLM Context Window Scaling [46.936900701411965]
LongRoPE2 is a novel approach that extends the effective context window of pre-trained large language models (LLMs) to the target length.<n>This is achieved by three contributions: (1) a hypothesis that insufficient training in higher RoPE dimensions contributes to the persistent out-of-distribution issues observed in existing methods; (2) an effective RoPE rescaling algorithm that adopts evolutionary search guided by "needle-driven" perplexity to address the insufficient training problem; and (3) a mixed context window training approach that fine-tunes model weights to adopt rescaled RoPE for long-context sequences.
arXiv Detail & Related papers (2025-02-27T13:41:07Z) - VideoRoPE: What Makes for Good Video Rotary Position Embedding? [109.88966080843608]
VideoRoPE consistently surpasses previous RoPE variants, across diverse downstream tasks such as long video retrieval, video understanding, and video hallucination.<n>VideoRoPE features textlow-frequency temporal allocation to mitigate periodic oscillations, a textitdiagonal layout to maintain spatial symmetry, and textadjustable temporal spacing to decouple temporal and spatial indexing.
arXiv Detail & Related papers (2025-02-07T18:56:04Z) - HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation [19.42279057349193]
positional encodings (PEs) are designed to exhibit long-term decay, based on an entrenched and long-standing inductive opinion.<n>We argue that long-term decay is outdated in the era of LLMs, as LLMs are now applied to tasks demanding precise retrieval of in-context information.
arXiv Detail & Related papers (2024-10-28T17:01:52Z) - Round and Round We Go! What makes Rotary Positional Encodings useful? [15.543752938828831]
We study the internals of a trained Gemma 7B model to understand how RoPE is being used at a mechanical level.<n>We find that Gemma learns to use RoPE to construct robust "positional" attention patterns by exploiting the highest frequencies.<n>We propose a modification of RoPE that fixes some highlighted issues and improves performance.
arXiv Detail & Related papers (2024-10-08T17:07:01Z) - Scaling Laws of RoPE-based Extrapolation [103.33995311915864]
We propose textbftextitScaling Laws of RoPE-based Extrapolation to describe the relationship between the extrapolation performance and base value.
We achieve extrapolation up to 1 million context length within only 16K training length on LLaMA2 7B and 13B.
arXiv Detail & Related papers (2023-10-08T15:50:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.