The Limits of Long-Context Reasoning in Automated Bug Fixing
- URL: http://arxiv.org/abs/2602.16069v1
- Date: Tue, 17 Feb 2026 22:51:40 GMT
- Title: The Limits of Long-Context Reasoning in Automated Bug Fixing
- Authors: Ravi Raju, Mengmeng Ji, Shubhangi Upasani, Bo Li, Urmish Thakker,
- Abstract summary: Large language models (LLMs) can directly reason over entire contexts.<n>Recent advances in LLMs have enabled strong performance on software engineering benchmarks.<n>We systematically evaluate whether current LLMs can reliably perform long-context code and patch generation.
- Score: 4.853967615615349
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Rapidly increasing context lengths have led to the assumption that large language models (LLMs) can directly reason over entire codebases. Concurrently, recent advances in LLMs have enabled strong performance on software engineering benchmarks, particularly when paired with agentic workflows. In this work, we systematically evaluate whether current LLMs can reliably perform long-context code debugging and patch generation. Using SWE-bench Verified as a controlled experimental setting, we first evaluate state-of-the-art models within an agentic harness (mini-SWE-agent), where performance improves substantially: GPT-5-nano achieves up to a 31\% resolve rate on 100 samples, and open-source models such as Deepseek-R1-0528 obtain competitive results. However, token-level analysis shows that successful agentic trajectories typically remain under 20k tokens, and that longer accumulated contexts correlate with lower success rates, indicating that agentic success primarily arises from task decomposition into short-context steps rather than effective long-context reasoning. To directly test long-context capability, we construct a data pipeline where we artificially inflate the context length of the input by placing the relevant files into the context (ensuring perfect retrieval recall); we then study single-shot patch generation under genuinely long contexts (64k-128k tokens). Despite this setup, performance degrades sharply: Qwen3-Coder-30B-A3B achieves only a 7\% resolve rate at 64k context, while GPT-5-nano solves none of the tasks. Qualitative analysis reveals systematic failure modes, including hallucinated diffs, incorrect file targets, and malformed patch headers. Overall, our findings highlight a significant gap between nominal context length and usable context capacity in current LLMs, and suggest that existing agentic coding benchmarks do not meaningfully evaluate long-context reasoning.
Related papers
- SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents [32.69890220986935]
We propose SWE-Pruner, a self-adaptive context pruning framework for coding agents.<n>SWE-Pruner performs task-aware adaptive pruning for long contexts.<n>It achieves 23-54% token reduction on agent tasks like SWE-Bench Verified and up to 14.84x compression on single-turn tasks like LongCodeQA.
arXiv Detail & Related papers (2026-01-23T13:51:59Z) - Intelligence Degradation in Long-Context LLMs: Critical Threshold Determination via Natural Length Distribution Analysis [2.085792950847639]
Large Language Models (LLMs) exhibit performance degradation when processing contexts approaching certain critical thresholds.<n>This intelligence degradation-defined as over 30% drop in task performance-severely limits long-context applications.<n>This work provides the first systematic characterization of intelligence degradation in open-source Qwen models.
arXiv Detail & Related papers (2026-01-07T07:56:31Z) - Let's (not) just put things in Context: Test-Time Training for Long-Context LLMs [39.99645732873852]
We show that inference-time strategies show rapidly diminishing returns and fail at long context.<n>We propose a simple method that overcomes limitations of static self-attention.<n>Our method leads to large 12.6 and 14.1 percentage point improvements for Qwen3-4B on average across subsets of LongBench-v2 and ZeroScrolls benchmarks.
arXiv Detail & Related papers (2025-12-15T21:01:37Z) - Short-Context Dominance: How Much Local Context Natural Language Actually Needs? [48.429870236229696]
We measure the minimum context length needed to reproduce accurate full-context predictions.<n>For sequences with 1-7k tokens from long-context documents, we consistently find that 75-80% require only the last 96 tokens at most.<n>We introduce a practical proxy to MCL, called Distributionally Aware MCL (DaMCL), that does not require knowledge of the actual next-token.
arXiv Detail & Related papers (2025-12-08T22:25:00Z) - LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering [90.84806758077536]
We introduce textbfLoCoBench-Agent, a comprehensive evaluation framework specifically designed to assess large language models (LLMs) agents in realistic, long-context software engineering.<n>Our framework extends LoCoBench's 8,000 scenarios into interactive agent environments, enabling systematic evaluation of multi-turn conversations.<n>Our framework provides agents with 8 specialized tools (file operations, search, code analysis) and evaluates them across context lengths ranging from 10K to 1M tokens.
arXiv Detail & Related papers (2025-11-17T23:57:24Z) - Overflow Prevention Enhances Long-Context Recurrent LLMs [81.71585057993074]
A recent trend in LLMs is developing recurrent sub-quadratic models that improve long-context processing efficiency.<n>We investigate leading large long-context models, focusing on how their fixed-size recurrent memory affects their performance.<n>Our experiments reveal that, even when these models are trained for extended contexts, their use of long contexts remains underutilized.
arXiv Detail & Related papers (2025-05-12T17:45:05Z) - What is Wrong with Perplexity for Long-context Language Modeling? [71.34933096461124]
Long-context inputs are crucial for large language models (LLMs) in tasks such as extended conversations, document summarization, and many-shot in-context learning.<n>Perplexity (PPL) has proven unreliable for assessing long-context capabilities.<n>We propose bfLongPPL, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them.
arXiv Detail & Related papers (2024-10-31T09:39:28Z) - HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly [34.205934899868346]
We introduce HELMET, a comprehensive benchmark encompassing seven diverse, application-centric categories.<n>We find that synthetic tasks like NIAH do not reliably predict downstream performance.<n>While most LCLMs achieve perfect NIAH scores, open-source models significantly lag behind closed ones when tasks require full-context reasoning.
arXiv Detail & Related papers (2024-10-03T17:20:11Z) - How to Train Long-Context Language Models (Effectively) [75.5418485597276]
We study continued training and supervised fine-tuning (SFT) of a language model (LM) to make effective use of long-context information.<n>We find that code repositories and books are excellent sources of long data, but it is crucial to combine them with high-quality short-context data.<n>Our final model, ProLong-8B, demonstrates state-of-the-art long-context performance among similarly sized models at a length of 128K.
arXiv Detail & Related papers (2024-10-03T16:46:52Z) - $\infty$Bench: Extending Long Context Evaluation Beyond 100K Tokens [64.08660301017302]
There is currently a lack of a standardized benchmark to evaluate this long-context capability.
$infty$Bench is the first benchmark featuring an average data length surpassing 100K tokens.
The results indicate that existing long context LLMs still require significant advancements to effectively process 100K+ context.
arXiv Detail & Related papers (2024-02-21T11:30:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.