Related papers: Scaling Laws of RoPE-based Extrapolation

Related papers

CoPE: Clipped RoPE as A Scalable Free Lunch for Long Context LLMs [18.897130541385646]
Rotary Positional Embedding (RoPE) is a key component of context scaling in Large Language Models (LLMs)<n>In this work, we unify these seemingly distinct objectives through a minimalist intervention, namely soft clipping lowfrequency components of RoPE.<n>CoPE not only eliminates OOD outliers and refines semantic signals, but also prevents spectral leakage caused by hard clipping.
arXiv Detail & Related papers (2026-02-05T03:31:14Z)
Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs [72.8830548005884]
Rotary Position Embeddings (RoPE) have become a standard for encoding sequence order in Large Language Models.<n>Standard implementations utilize only the real component of the complex-valued dot product for attention score calculation.<n>We propose an extension that re-incorporates this imaginary component.
arXiv Detail & Related papers (2025-12-08T12:59:54Z)
Rethinking RoPE Scaling in Quantized LLM: Theory, Outlier, and Channel-Band Analysis with Weight Rescaling [3.7391437252721698]
We show that combining RoPE position-aware (PI) with PTQ degrades accuracy due to effects including long-context aliasing, dynamic-range dilation, anisotropy from axis-aligned quantizers vs rotated RoPE pairs.<n>We propose Q-ROAR (Quantization, RoPE-interpolation, and Outlier Aware Rescaling), a weight-only, awareness-aware stabilization of PI for quantized LLMs.
arXiv Detail & Related papers (2025-09-26T01:23:32Z)
Q-ROAR: Outlier-Aware Rescaling for RoPE Position Interpolation in Quantized Long-Context LLMs [0.9510848451801044]
We show that combining PI with PTQ degrades accuracy due to coupled effects long context aliasing, dynamic range dilation, axis grid anisotropy, and shifting that induce position-dependent logit noise.<n>We propose Q-ROAR, a RoPE-aware, weight-only stabilization that groups RoPE dimensions into a few frequency bands and performs a small search over per-band scales for W_Q,W_K, with an optional symmetric variant to preserve logit scale.
arXiv Detail & Related papers (2025-09-17T19:50:16Z)
Positional Encoding via Token-Aware Phase Attention [45.855203550592734]
We show that Rotary Positional Embedding (RoPE) introduces an intrinsic distance-dependent bias in attention scores that limits RoPE's ability to model long-context.<n>This paper introduces Token-Aware Phase Attention (TAPA), a new positional encoding method that incorporates a learnable phase function into the attention mechanism.
arXiv Detail & Related papers (2025-09-16T03:53:32Z)
Effective Length Extrapolation via Dimension-Wise Positional Embeddings Manipulation [60.22622442950905]
Large Language Models (LLMs) often struggle to process and generate coherent context when the number of input tokens exceeds the pre-trained length. Recent advancements in long-context extension have significantly expanded the context window of LLMs. We propose Dimension-Wise Positional Embeddings Manipulation (DPE) to extrapolate the context window of LLMs.
arXiv Detail & Related papers (2025-04-26T08:46:10Z)
LongRoPE2: Near-Lossless LLM Context Window Scaling [46.936900701411965]
LongRoPE2 is a novel approach that extends the effective context window of pre-trained large language models (LLMs) to the target length. This is achieved by three contributions: (1) a hypothesis that insufficient training in higher RoPE dimensions contributes to the persistent out-of-distribution issues observed in existing methods; (2) an effective RoPE rescaling algorithm that adopts evolutionary search guided by "needle-driven" perplexity to address the insufficient training problem; and (3) a mixed context window training approach that fine-tunes model weights to adopt rescaled RoPE for long-context sequences.
arXiv Detail & Related papers (2025-02-27T13:41:07Z)
Rope to Nope and Back Again: A New Hybrid Attention Strategy [18.13605820945755]
Long-context large language models (LLMs) have achieved remarkable advancements, driven by techniques like Rotary Position Embedding (RoPE) This paper presents a comprehensive analysis of various attention mechanisms, including RoPE, No Positional Embedding (NoPE), and Query-Key Normalization (QK-Norm) We propose a novel architectural based on a hybrid attention mechanism that surpasses conventional RoPE-based transformer models in long context tasks and achieves competitive performance on benchmarks requiring shorter context lengths.
arXiv Detail & Related papers (2025-01-30T23:05:57Z)
When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training [51.23520027773028]
Extending context window sizes allows large language models to process longer sequences and handle more complex tasks. We observe that using RoPE with BFloat16 format results in numerical issues, causing it to deviate from its intended relative positional encoding. We develop AnchorAttention, a plug-and-play attention method that alleviates numerical issues caused by BFloat16.
arXiv Detail & Related papers (2024-11-20T17:22:31Z)
What is Wrong with Perplexity for Long-context Language Modeling? [71.34933096461124]
Long-context inputs are crucial for large language models (LLMs) in tasks such as extended conversations, document summarization, and many-shot in-context learning. Perplexity (PPL) has proven unreliable for assessing long-context capabilities. We propose bfLongPPL, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them.
arXiv Detail & Related papers (2024-10-31T09:39:28Z)
Round and Round We Go! What makes Rotary Positional Encodings useful? [15.543752938828831]
We study the internals of a trained Gemma 7B model to understand how RoPE is being used at a mechanical level. We find that Gemma learns to use RoPE to construct robust "positional" attention patterns by exploiting the highest frequencies. We propose a modification of RoPE that fixes some highlighted issues and improves performance.
arXiv Detail & Related papers (2024-10-08T17:07:01Z)
Extending Context Window of Large Language Models from a Distributional Perspective [29.313701168816507]
We propose to optimize the context window extending task from the view of rotary angle distribution. We present a novel extension strategy that minimizes the disturbance between rotary angle distributions. Our method achieves an average improvement of up to 4.33% over existing state-of-the-art methods.
arXiv Detail & Related papers (2024-10-02T12:40:11Z)
Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective [35.947737679664016]
This paper offers a straightforward yet in-depth understanding of RoPE extensions from an attention perspective. Using longer continual pretraining lengths for RoPE extensions could reduce attention uncertainty and significantly enhance extrapolation.
arXiv Detail & Related papers (2024-06-19T07:23:33Z)
Base of RoPE Bounds Context Length [37.11078116104313]
Rotary position embedding (RoPE) is a technique that encodes the position information with a rotation matrix. In this paper, we find that LLMs may obtain a superficial long-context ability based on the OOD theory. Our work reveals the relationship between context length and RoPE base both theoretically and empirically, which may shed light on future long context training.
arXiv Detail & Related papers (2024-05-23T14:03:31Z)
Long Context Alignment with Short Instructions and Synthesized Positions [56.1267385315404]
This paper introduces Step-Skipping Alignment (SkipAlign) It is a new technique designed to enhance the long-context capabilities of Large Language Models (LLMs) With a careful selection of the base model and alignment datasets, SkipAlign with only 6B parameters achieves it's best performance and comparable with strong baselines like GPT-3.5-Turbo-16K on LongBench.
arXiv Detail & Related papers (2024-05-07T01:56:22Z)
Resonance RoPE: Improving Context Length Generalization of Large Language Models [37.749813693281254]
This paper addresses the challenge of train-short-test-long (TSTL) scenarios in Large Language Models (LLMs) equipped with Rotary Position Embedding (RoPE) We introduce Resonance RoPE, a novel approach designed to narrow the generalization gap in TSTL scenarios. We present PosGen, a new synthetic benchmark specifically designed for fine-grained behavior analysis in TSTL scenarios.
arXiv Detail & Related papers (2024-02-29T19:02:03Z)
CLEX: Continuous Length Extrapolation for Large Language Models [68.43814043853347]
We propose Continuous Length EXtrapolation (CLEX) for Large Language Models (LLMs) CLEX extends the context window to over 4x or almost 8x training length, with no deterioration in performance. Our model trained on a 4k length exhibits competitive performance against state-of-the-art open-source models trained on context lengths up to 32k.
arXiv Detail & Related papers (2023-10-25T08:13:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.