Related papers: Towards Infinite Length Extrapolation: A Unified Approach

Towards Infinite Length Extrapolation: A Unified Approach

URL: http://arxiv.org/abs/2601.06113v1
Date: Sat, 03 Jan 2026 14:10:23 GMT
Title: Towards Infinite Length Extrapolation: A Unified Approach
Authors: Nitin Vetcha,
Abstract summary: Large language models (LLMs) have revolutionized natural language processing, but their ability to process long sequences is fundamentally limited by the context window size during training.<n>We use a unified framework that reinterprets positional encoding methods as a decomposition of the attention score into a multiplicative transformation and an additive bias.<n>Our theoretical analysis establishes conditions for infinite-context extrapolation, ensuring that the softmax handling remains well-defined over unbounded sequences while preserving long-distance correlations, entropy boundedness and gradient positional sensitivity.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have revolutionized natural language processing, but their ability to process long sequences is fundamentally limited by the context window size during training. Existing length extrapolation methods often suffer from performance degradation or computational inefficiencies. We thereby use a unified framework that reinterprets positional encoding methods as a decomposition of the attention score into a multiplicative transformation and an additive bias. This perspective not only subsumes popular approaches such as relative position embeddings and attention-bias moderated approaches but also exposes their inherent limitations in handling long-range dependencies. To address these shortcomings, motivated by our framework, we introduce Adaptive Positional Encoding (APE), which leverages adaptive frequency modulation and an intricately designed decay bias that incorporates linear, logarithmic, and square-root terms. Our theoretical analysis establishes conditions for infinite-context extrapolation, ensuring that the softmax normalization remains well-defined over unbounded sequences while preserving long-distance correlations, entropy boundedness and gradient positional sensitivity. We substantiate our claims with an experimental case study on TinyStories dataset as well as a new synthetic dataset, \emph{Long Tiny Stories} featuring stories up to 32,000 words. Relevant code, dataset and model weights are available at https://anonymous.4open.science/r/Check-2DAD/.

Related papers

TS-Haystack: A Multi-Scale Retrieval Benchmark for Time Series Language Models [4.387988928531881]
Time Series Language Models (TSLMs) are emerging as unified models for reasoning over continuous signals in natural language.<n>Existing models are typically trained and evaluated on short sequences, while real-world time-series sensor streams can span millions of datapoints.<n>We introduce TS-Haystack, a long-context temporal retrieval benchmark comprising ten task types across four categories.
arXiv Detail & Related papers (2026-02-15T15:50:02Z)
Gated Differentiable Working Memory for Long-Context Language Modeling [80.27483324685434]
We propose Gdwm (Gated Differentiable Working Memory), a framework that introduces a write controller to gate the consolidation process.<n>Experiments on ZeroSCROLLS and LongBench v2 demonstrate that Gdwm achieves comparable or superior performance with 4$times$ fewer gradient steps than uniform baselines.
arXiv Detail & Related papers (2026-01-19T10:00:33Z)
Dimension-free error estimate for diffusion model and optimal scheduling [22.20348860913421]
Diffusion generative models have emerged as powerful tools for producing synthetic data from an empirically observed distribution.<n>Previous analyses have quantified the error between the generated and the true data distributions in terms of Wasserstein distance or Kullback-Leibler divergence.<n>In this work, we derive an explicit, dimension-free bound on the discrepancy between the generated and the true data distributions.
arXiv Detail & Related papers (2025-12-01T15:58:20Z)
Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive.<n> LCD can distort the global distribution over strings, sampling tokens based only on local information.<n>We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z)
Context-aware Biases for Length Extrapolation [0.19116784879310025]
We propose an additive RPE, Context-Aware Biases for Length Extrapolation (CABLE)<n>CABLE learns token-specific, context-aware biases for each attention head in transformers.<n>Our method significantly enhances the performance of existing RPE methods tested on the FineWeb-Edu-10B and WikiText-103 datasets.
arXiv Detail & Related papers (2025-03-11T05:54:58Z)
PICASO: Permutation-Invariant Context Composition with State Space Models [98.91198288025117]
State Space Models (SSMs) offer a promising solution by allowing a database of contexts to be mapped onto fixed-dimensional states.<n>We propose a simple mathematical relation derived from SSM dynamics to compose multiple states into one that efficiently approximates the effect of concatenating raw context tokens.<n>We evaluate our resulting method on WikiText and MSMARCO in both zero-shot and fine-tuned settings, and show that we can match the strongest performing baseline while enjoying on average 5.4x speedup.
arXiv Detail & Related papers (2025-02-24T19:48:00Z)
Rethinking Addressing in Language Models via Contexualized Equivariant Positional Encoding [89.52931576290976]
We present contextbfTextualized equivaritextbfAnt textbfPosition textbfEncoding (textbfTAPE), a novel framework that enhances positional embeddings by incorporating sequence content across layers.<n>Our method can be easily integrated into pre-trained transformers, offering parameter-efficient fine-tuning with minimal overhead.
arXiv Detail & Related papers (2025-01-01T03:23:00Z)
Accelerated zero-order SGD under high-order smoothness and overparameterized regime [79.85163929026146]
We present a novel gradient-free algorithm to solve convex optimization problems. Such problems are encountered in medicine, physics, and machine learning. We provide convergence guarantees for the proposed algorithm under both types of noise.
arXiv Detail & Related papers (2024-11-21T10:26:17Z)
Tilt your Head: Activating the Hidden Spatial-Invariance of Classifiers [0.7704032792820767]
Deep neural networks are applied in more and more areas of everyday life. They still lack essential abilities, such as robustly dealing with spatially transformed input signals. We propose a novel technique to emulate such an inference process for neural nets.
arXiv Detail & Related papers (2024-05-06T09:47:29Z)
Efficient and Near-Optimal Smoothed Online Learning for Generalized Linear Functions [28.30744223973527]
We give a computationally efficient algorithm that is the first to enjoy the statistically optimal log(T/sigma) regret for realizable K-wise linear classification. We develop a novel characterization of the geometry of the disagreement region induced by generalized linear classifiers.
arXiv Detail & Related papers (2022-05-25T21:31:36Z)
Robust Implicit Networks via Non-Euclidean Contractions [63.91638306025768]
Implicit neural networks show improved accuracy and significant reduction in memory consumption. They can suffer from ill-posedness and convergence instability. This paper provides a new framework to design well-posed and robust implicit neural networks.
arXiv Detail & Related papers (2021-06-06T18:05:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.