Intelligence Degradation in Long-Context LLMs: Critical Threshold Determination via Natural Length Distribution Analysis
- URL: http://arxiv.org/abs/2601.15300v1
- Date: Wed, 07 Jan 2026 07:56:31 GMT
- Title: Intelligence Degradation in Long-Context LLMs: Critical Threshold Determination via Natural Length Distribution Analysis
- Authors: Weiwei Wang, Jiyong Min, Weijie Zou,
- Abstract summary: Large Language Models (LLMs) exhibit performance degradation when processing contexts approaching certain critical thresholds.<n>This intelligence degradation-defined as over 30% drop in task performance-severely limits long-context applications.<n>This work provides the first systematic characterization of intelligence degradation in open-source Qwen models.
- Score: 2.085792950847639
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Large Language Models (LLMs) exhibit catastrophic performance degradation when processing contexts approaching certain critical thresholds, even when information remains relevant. This intelligence degradation-defined as over 30% drop in task performance-severely limits long-context applications. This degradation shows a common pattern: models maintain strong performance up to a critical threshold, then collapse catastrophically. We term this shallow long-context adaptation-models adapt for short to medium contexts but fail beyond critical thresholds. This paper presents three contributions: (1) Natural Length Distribution Analysis: We use each sample's natural token length without truncation or padding, providing stronger causal evidence that degradation results from context length itself. (2) Critical Threshold Determination: Through experiments on a mixed dataset (1,000 samples covering 5%-95% of context length), we identify the critical threshold for Qwen2.5-7B at 40-50% of maximum context length, where F1 scores drop from 0.55-0.56 to 0.3 (45.5% degradation), using five-method cross-validation. (3) Unified Framework: We consolidate shallow adaptation, explaining degradation patterns and providing a foundation for mitigation strategies. This work provides the first systematic characterization of intelligence degradation in open-source Qwen models, offering practical guidance for deploying LLMs in long-context scenarios.
Related papers
- The Limits of Long-Context Reasoning in Automated Bug Fixing [4.853967615615349]
Large language models (LLMs) can directly reason over entire contexts.<n>Recent advances in LLMs have enabled strong performance on software engineering benchmarks.<n>We systematically evaluate whether current LLMs can reliably perform long-context code and patch generation.
arXiv Detail & Related papers (2026-02-17T22:51:40Z) - Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning [53.58654277639939]
In-context exploration is the intrinsic ability to generate, verify, and refine hypotheses within a single continuous context.<n>We propose Length-Incentivized Exploration, which explicitly encourages models to explore more.<n>Our method achieves an average improvement of 4.4% on in-domain tasks and a 2.7% gain on out-of-domain benchmarks.
arXiv Detail & Related papers (2026-02-12T09:24:32Z) - ReEfBench: Quantifying the Reasoning Efficiency of LLMs [9.462320482705508]
We propose a novel neuro-symbolic framework for the non-intrusive, comprehensive process-centric evaluation of reasoning.<n>Our analysis reveals that extended token generation is not a prerequisite for deep reasoning.
arXiv Detail & Related papers (2026-01-07T03:33:07Z) - Towards Infinite Length Extrapolation: A Unified Approach [0.0]
Large language models (LLMs) have revolutionized natural language processing, but their ability to process long sequences is fundamentally limited by the context window size during training.<n>We use a unified framework that reinterprets positional encoding methods as a decomposition of the attention score into a multiplicative transformation and an additive bias.<n>Our theoretical analysis establishes conditions for infinite-context extrapolation, ensuring that the softmax handling remains well-defined over unbounded sequences while preserving long-distance correlations, entropy boundedness and gradient positional sensitivity.
arXiv Detail & Related papers (2026-01-03T14:10:23Z) - Latent Sculpting for Zero-Shot Generalization: A Manifold Learning Approach to Out-of-Distribution Anomaly Detection [2.8547732086436306]
A fundamental limitation of supervised deep learning is "Generalization Collapse"<n>We propose Latent Sculpting, a hierarchical two-stage representation learning framework.<n>We report an 88.89% detection rate on "Infiltration" scenarios.
arXiv Detail & Related papers (2025-12-19T11:37:02Z) - Context Length Alone Hurts LLM Performance Despite Perfect Retrieval [29.523005523787244]
Large language models (LLMs) often fail to scale their performance on long-context tasks in line with the context lengths they support.<n>This paper presents findings that the answer to this question may be negative.
arXiv Detail & Related papers (2025-10-06T21:17:13Z) - Beyond Memorization: Reasoning-Driven Synthesis as a Mitigation Strategy Against Benchmark Contamination [77.69093448529455]
We present an empirical study using an infinitely scalable framework to synthesize research-level QA directly from arXiv papers.<n>We evaluate a lack of significant performance decay near knowledge cutoff dates for models of various sizes, developers, and release dates.<n>We hypothesize that the multi-step reasoning required by our synthesis pipeline offered additional complexity that goes deeper than shallow memorization.
arXiv Detail & Related papers (2025-08-26T16:41:37Z) - A Controllable Examination for Long-Context Language Models [62.845852724511964]
This study introduces $textbfLongBioBench, a benchmark for evaluating long-context language models.<n>We show that most models still exhibit deficiencies in semantic understanding and elementary reasoning over retrieved results.<n>Our further analysis indicates some design choices employed by existing synthetic benchmarks, such as contextual non-coherence.
arXiv Detail & Related papers (2025-06-03T14:23:06Z) - Adaptive Distraction: Probing LLM Contextual Robustness with Automated Tree Search [76.54475437069395]
Large Language Models (LLMs) often struggle to maintain their original performance when faced with semantically coherent but task-irrelevant contextual information.<n>We propose a dynamic distraction generation framework based on tree search, where the generation process is guided by model behavior.
arXiv Detail & Related papers (2025-02-03T18:43:36Z) - CNNSum: Exploring Long-Context Summarization with Large Language Models in Chinese Novels [11.614599448394374]
We introduce CNNSum, a multi-scale long-context summarization benchmark based on Chinese novels.<n>CNNSum features human-driven annotations across four subsets totaling 695 samples, with lengths ranging from 16k to 128k.<n>We benchmark numerous LLMs and conduct detailed human assessments to summarize abnormal output types.
arXiv Detail & Related papers (2024-12-03T20:35:57Z) - SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention [53.4441894198495]
Large language models (LLMs) now support extremely long context windows.<n>The quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency.<n>We propose SampleAttention, an adaptive structured and near-lossless sparse attention.
arXiv Detail & Related papers (2024-06-17T11:05:15Z) - Test-time Batch Statistics Calibration for Covariate Shift [66.7044675981449]
We propose to adapt the deep models to the novel environment during inference.
We present a general formulation $alpha$-BN to calibrate the batch statistics.
We also present a novel loss function to form a unified test time adaptation framework Core.
arXiv Detail & Related papers (2021-10-06T08:45:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.