Related papers: Rethinking RoPE Scaling in Quantized LLM: Theory, Outlier, and Channel-Band Analysis with Weight Rescaling

Rethinking RoPE Scaling in Quantized LLM: Theory, Outlier, and Channel-Band Analysis with Weight Rescaling

URL: http://arxiv.org/abs/2510.00028v1
Date: Fri, 26 Sep 2025 01:23:32 GMT
Title: Rethinking RoPE Scaling in Quantized LLM: Theory, Outlier, and Channel-Band Analysis with Weight Rescaling
Authors: Ye Qiao, Haocheng Xu, Xiaofan Zhang, Sitao Huang,
Abstract summary: We show that combining RoPE position-aware (PI) with PTQ degrades accuracy due to effects including long-context aliasing, dynamic-range dilation, anisotropy from axis-aligned quantizers vs rotated RoPE pairs.<n>We propose Q-ROAR (Quantization, RoPE-interpolation, and Outlier Aware Rescaling), a weight-only, awareness-aware stabilization of PI for quantized LLMs.
Score: 3.7391437252721698
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Extending the context window support of large language models (LLMs) is crucial for tasks with long-distance dependencies. RoPE-based interpolation and extrapolation methods, such as linear scaling and frequency-aware schemes, enable longer input length support without retraining, while post-training quantization (PTQ) makes deployment practical. However, we show that combining RoPE position interpolation (PI) with PTQ degrades accuracy due to coupled effects including long-context aliasing, dynamic-range dilation, anisotropy from axis-aligned quantizers vs. rotated RoPE pairs, and outlier shifting that produces position-dependent logit noise. We provide, to the best of our knowledge, the first systematic analysis of the PI+PTQ approach and introduce two practical diagnostics: interpolation pressure (per-band sensitivity to phase scaling) and tail-inflation ratios (outlier shift from short to long contexts). Following the analysis results, we propose Q-ROAR (Quantization, RoPE-interpolation, and Outlier Aware Rescaling), a weight-only, interpolation-aware stabilization of PI for quantized LLMs. Q-ROAR groups RoPE dimensions into a small number of frequency bands and performs a lightweight search over per-band scales for Key and Query weights (with an optional symmetric variant to preserve logit scale). The search is guided by our diagnostics and uses a tiny long-context development dataset, requiring no fine-tuning to the model, no architecture or kernel changes, and no additional deployment overhead. Empirically, Q-ROAR reduces the model's perplexity on long-context workloads by more than 14%, while preserving short-context performance, inference throughput, and compatibility with existing LLM system stacks.

Related papers

Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs [72.8830548005884]
Rotary Position Embeddings (RoPE) have become a standard for encoding sequence order in Large Language Models.<n>Standard implementations utilize only the real component of the complex-valued dot product for attention score calculation.<n>We propose an extension that re-incorporates this imaginary component.
arXiv Detail & Related papers (2025-12-08T12:59:54Z)
ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference [13.283581083797484]
Post-training quantization (PTQ) compresses the weights of Large Language Models (LLMs) into low-precision representations to reduce memory footprint and accelerate inference.<n>The presence of outliers in weights and activations often leads to large quantization errors and severe accuracy degradation.<n>We propose Pairwise Rotation Quantization (ParoQuant) to suppress outliers and introduce significant overhead during inference.<n>ParoQuant achieves an average 2.4% accuracy improvement over AWQ on reasoning tasks with less than 10% overhead.
arXiv Detail & Related papers (2025-11-13T18:59:24Z)
DoPE: Denoising Rotary Position Embedding [60.779039511252584]
Rotary Position Embedding (RoPE) in Transformer models has inherent limits that weaken length.<n>We reinterpret the attention map with positional encoding as a noisy feature map, and propose Denoising Positional extrapolation page (DoPE)<n>DoPE is a training-free method based on truncated matrix entropy to detect outlier frequency bands in the feature map.
arXiv Detail & Related papers (2025-11-12T09:32:35Z)
Q-ROAR: Outlier-Aware Rescaling for RoPE Position Interpolation in Quantized Long-Context LLMs [0.9510848451801044]
We show that combining PI with PTQ degrades accuracy due to coupled effects long context aliasing, dynamic range dilation, axis grid anisotropy, and shifting that induce position-dependent logit noise.<n>We propose Q-ROAR, a RoPE-aware, weight-only stabilization that groups RoPE dimensions into a few frequency bands and performs a small search over per-band scales for W_Q,W_K, with an optional symmetric variant to preserve logit scale.
arXiv Detail & Related papers (2025-09-17T19:50:16Z)
HoPE: Hyperbolic Rotary Positional Encoding for Stable Long-Range Dependency Modeling in Large Language Models [19.3827288035483]
We propose Hyperbolic Rotary Positional.<n>(HoPE) which leverages hyperbolic functions to implement Lorentz rotations on token representations.<n>Tests show HoPE consistently exceeds existing positional encoding methods.
arXiv Detail & Related papers (2025-09-05T16:20:48Z)
Decoupling the "What" and "Where" With Polar Coordinate Positional Embeddings [29.421443764865003]
We present an analysis indicating that what and where are entangled in the popular RoPE rotary position embedding.<n>We propose an improvement to RoPE, which we call Polar Coordinate Position Embeddings or PoPE, that eliminates the what-where confound.
arXiv Detail & Related papers (2025-09-05T14:22:27Z)
QuantVSR: Low-Bit Post-Training Quantization for Real-World Video Super-Resolution [53.13952833016505]
We propose a low-bit quantization model for real-world video super-resolution (VSR)<n>We use a calibration dataset to measure both spatial and temporal complexity for each layer.<n>We refine the FP and low-bit branches to achieve simultaneous optimization.
arXiv Detail & Related papers (2025-08-06T14:35:59Z)
LaMPE: Length-aware Multi-grained Positional Encoding for Adaptive Long-context Scaling Without Training [45.74983991122073]
Large language models (LLMs) experience significant performance degradation when the input exceeds the pretraining context window.<n>Recent studies mitigate this problem by remapping OOD positions into the in-distribution range with fixed mapping strategies.<n>We propose Length-aware Multi-grained Positional Scaling (LaMPE), a training-free method that fully utilizes the model's effective context window.
arXiv Detail & Related papers (2025-08-04T11:22:13Z)
Agentic Reinforced Policy Optimization [66.96989268893932]
Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks.<n>Current RL algorithms inadequately balance the models' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions.<n>We propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents.
arXiv Detail & Related papers (2025-07-26T07:53:11Z)
RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z)
Scaling Laws of RoPE-based Extrapolation [103.33995311915864]
We propose textbftextitScaling Laws of RoPE-based Extrapolation to describe the relationship between the extrapolation performance and base value. We achieve extrapolation up to 1 million context length within only 16K training length on LLaMA2 7B and 13B.
arXiv Detail & Related papers (2023-10-08T15:50:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.