Related papers: Intelligence Degradation in Long-Context LLMs: Critical Threshold Determination via Natural Length Distribution Analysis

Intelligence Degradation in Long-Context LLMs: Critical Threshold Determination via Natural Length Distribution Analysis

URL: http://arxiv.org/abs/2601.15300v1
Date: Wed, 07 Jan 2026 07:56:31 GMT
Title: Intelligence Degradation in Long-Context LLMs: Critical Threshold Determination via Natural Length Distribution Analysis
Authors: Weiwei Wang, Jiyong Min, Weijie Zou,
Abstract summary: Large Language Models (LLMs) exhibit performance degradation when processing contexts approaching certain critical thresholds.<n>This intelligence degradation-defined as over 30% drop in task performance-severely limits long-context applications.<n>This work provides the first systematic characterization of intelligence degradation in open-source Qwen models.
Score: 2.085792950847639
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Large Language Models (LLMs) exhibit catastrophic performance degradation when processing contexts approaching certain critical thresholds, even when information remains relevant. This intelligence degradation-defined as over 30% drop in task performance-severely limits long-context applications. This degradation shows a common pattern: models maintain strong performance up to a critical threshold, then collapse catastrophically. We term this shallow long-context adaptation-models adapt for short to medium contexts but fail beyond critical thresholds. This paper presents three contributions: (1) Natural Length Distribution Analysis: We use each sample's natural token length without truncation or padding, providing stronger causal evidence that degradation results from context length itself. (2) Critical Threshold Determination: Through experiments on a mixed dataset (1,000 samples covering 5%-95% of context length), we identify the critical threshold for Qwen2.5-7B at 40-50% of maximum context length, where F1 scores drop from 0.55-0.56 to 0.3 (45.5% degradation), using five-method cross-validation. (3) Unified Framework: We consolidate shallow adaptation, explaining degradation patterns and providing a foundation for mitigation strategies. This work provides the first systematic characterization of intelligence degradation in open-source Qwen models, offering practical guidance for deploying LLMs in long-context scenarios.

Related papers

The Limits of Long-Context Reasoning in Automated Bug Fixing [4.853967615615349]
Large language models (LLMs) can directly reason over entire contexts.<n>Recent advances in LLMs have enabled strong performance on software engineering benchmarks.<n>We systematically evaluate whether current LLMs can reliably perform long-context code and patch generation.
arXiv Detail & Related papers (2026-02-17T22:51:40Z)
Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning [53.58654277639939]
In-context exploration is the intrinsic ability to generate, verify, and refine hypotheses within a single continuous context.<n>We propose Length-Incentivized Exploration, which explicitly encourages models to explore more.<n>Our method achieves an average improvement of 4.4% on in-domain tasks and a 2.7% gain on out-of-domain benchmarks.
arXiv Detail & Related papers (2026-02-12T09:24:32Z)
ReEfBench: Quantifying the Reasoning Efficiency of LLMs [9.462320482705508]
We propose a novel neuro-symbolic framework for the non-intrusive, comprehensive process-centric evaluation of reasoning.<n>Our analysis reveals that extended token generation is not a prerequisite for deep reasoning.
arXiv Detail & Related papers (2026-01-07T03:33:07Z)
Towards Infinite Length Extrapolation: A Unified Approach [0.0]
Large language models (LLMs) have revolutionized natural language processing, but their ability to process long sequences is fundamentally limited by the context window size during training.<n>We use a unified framework that reinterprets positional encoding methods as a decomposition of the attention score into a multiplicative transformation and an additive bias.<n>Our theoretical analysis establishes conditions for infinite-context extrapolation, ensuring that the softmax handling remains well-defined over unbounded sequences while preserving long-distance correlations, entropy boundedness and gradient positional sensitivity.
arXiv Detail & Related papers (2026-01-03T14:10:23Z)
Latent Sculpting for Zero-Shot Generalization: A Manifold Learning Approach to Out-of-Distribution Anomaly Detection [2.8547732086436306]
A fundamental limitation of supervised deep learning is "Generalization Collapse"<n>We propose Latent Sculpting, a hierarchical two-stage representation learning framework.<n>We report an 88.89% detection rate on "Infiltration" scenarios.
arXiv Detail & Related papers (2025-12-19T11:37:02Z)
Context Length Alone Hurts LLM Performance Despite Perfect Retrieval [29.523005523787244]
Large language models (LLMs) often fail to scale their performance on long-context tasks in line with the context lengths they support.<n>This paper presents findings that the answer to this question may be negative.
arXiv Detail & Related papers (2025-10-06T21:17:13Z)
Beyond Memorization: Reasoning-Driven Synthesis as a Mitigation Strategy Against Benchmark Contamination [77.69093448529455]
We present an empirical study using an infinitely scalable framework to synthesize research-level QA directly from arXiv papers.<n>We evaluate a lack of significant performance decay near knowledge cutoff dates for models of various sizes, developers, and release dates.<n>We hypothesize that the multi-step reasoning required by our synthesis pipeline offered additional complexity that goes deeper than shallow memorization.
arXiv Detail & Related papers (2025-08-26T16:41:37Z)
A Controllable Examination for Long-Context Language Models [62.845852724511964]
This study introduces $textbfLongBioBench, a benchmark for evaluating long-context language models.<n>We show that most models still exhibit deficiencies in semantic understanding and elementary reasoning over retrieved results.<n>Our further analysis indicates some design choices employed by existing synthetic benchmarks, such as contextual non-coherence.
arXiv Detail & Related papers (2025-06-03T14:23:06Z)
Adaptive Distraction: Probing LLM Contextual Robustness with Automated Tree Search [76.54475437069395]
Large Language Models (LLMs) often struggle to maintain their original performance when faced with semantically coherent but task-irrelevant contextual information.<n>We propose a dynamic distraction generation framework based on tree search, where the generation process is guided by model behavior.
arXiv Detail & Related papers (2025-02-03T18:43:36Z)
CNNSum: Exploring Long-Context Summarization with Large Language Models in Chinese Novels [11.614599448394374]
We introduce CNNSum, a multi-scale long-context summarization benchmark based on Chinese novels.<n>CNNSum features human-driven annotations across four subsets totaling 695 samples, with lengths ranging from 16k to 128k.<n>We benchmark numerous LLMs and conduct detailed human assessments to summarize abnormal output types.
arXiv Detail & Related papers (2024-12-03T20:35:57Z)
SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention [53.4441894198495]
Large language models (LLMs) now support extremely long context windows.<n>The quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency.<n>We propose SampleAttention, an adaptive structured and near-lossless sparse attention.
arXiv Detail & Related papers (2024-06-17T11:05:15Z)
Test-time Batch Statistics Calibration for Covariate Shift [66.7044675981449]
We propose to adapt the deep models to the novel environment during inference. We present a general formulation $alpha$-BN to calibrate the batch statistics. We also present a novel loss function to form a unified test time adaptation framework Core.
arXiv Detail & Related papers (2021-10-06T08:45:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.