Fugu-MT 論文翻訳(概要): The Limits of Long-Context Reasoning in Automated Bug Fixing

論文の概要: The Limits of Long-Context Reasoning in Automated Bug Fixing

arxiv url: http://arxiv.org/abs/2602.16069v1
Date: Tue, 17 Feb 2026 22:51:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-19 15:58:30.460919
Title: The Limits of Long-Context Reasoning in Automated Bug Fixing
Title（参考訳）: 自動バグ修正におけるロングコンテキスト推論の限界
Authors: Ravi Raju, Mengmeng Ji, Shubhangi Upasani, Bo Li, Urmish Thakker,
Abstract要約: 大規模言語モデル(LLM)は、コンテキスト全体を直接推論することができる。 LLMの最近の進歩は、ソフトウェア工学のベンチマークで強力なパフォーマンスを実現している。我々は,現在のLLMが長文コードとパッチ生成を確実に実行可能であるかどうかを体系的に評価する。
参考スコア（独自算出の注目度）: 4.853967615615349
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Rapidly increasing context lengths have led to the assumption that large language models (LLMs) can directly reason over entire codebases. Concurrently, recent advances in LLMs have enabled strong performance on software engineering benchmarks, particularly when paired with agentic workflows. In this work, we systematically evaluate whether current LLMs can reliably perform long-context code debugging and patch generation. Using SWE-bench Verified as a controlled experimental setting, we first evaluate state-of-the-art models within an agentic harness (mini-SWE-agent), where performance improves substantially: GPT-5-nano achieves up to a 31\% resolve rate on 100 samples, and open-source models such as Deepseek-R1-0528 obtain competitive results. However, token-level analysis shows that successful agentic trajectories typically remain under 20k tokens, and that longer accumulated contexts correlate with lower success rates, indicating that agentic success primarily arises from task decomposition into short-context steps rather than effective long-context reasoning. To directly test long-context capability, we construct a data pipeline where we artificially inflate the context length of the input by placing the relevant files into the context (ensuring perfect retrieval recall); we then study single-shot patch generation under genuinely long contexts (64k-128k tokens). Despite this setup, performance degrades sharply: Qwen3-Coder-30B-A3B achieves only a 7\% resolve rate at 64k context, while GPT-5-nano solves none of the tasks. Qualitative analysis reveals systematic failure modes, including hallucinated diffs, incorrect file targets, and malformed patch headers. Overall, our findings highlight a significant gap between nominal context length and usable context capacity in current LLMs, and suggest that existing agentic coding benchmarks do not meaningfully evaluate long-context reasoning.
Abstract（参考訳）: コンテキスト長の急速な増加は、大きな言語モデル(LLM)がコードベース全体を直接的に推論できるという仮定につながった。同時に、LLMの最近の進歩は、特にエージェントワークフローと組み合わせた場合、ソフトウェアエンジニアリングのベンチマークで強力なパフォーマンスを実現している。本研究では,現在のLLMが長文コードのデバッグやパッチ生成を確実に行うことができるかどうかを系統的に評価する。 GPT-5-nanoは100サンプルに対して最大31倍の分解率を達成し,Deepseek-R1-0528のようなオープンソースモデルは競争力のある結果を得る。しかし、トークンレベルの分析では、成功したエージェントの軌道は一般に20kのトークン以下であり、長時間蓄積されたコンテキストは低い成功率と相関しており、エージェントの成功は主に、効果的な長期コンテキスト推論ではなく、タスクの分解から短コンテキストステップへと生じることを示している。データパイプラインを構築し,関連するファイルをコンテキストに配置することで,コンテキスト長を人工的にインフレーションする(完全なリカバリを保証)。 Qwen3-Coder-30B-A3Bは64kコンテキストで7\%のリゾルバ率しか達成せず、GPT-5-nanoはどのタスクも解決しない。定性的分析では、幻覚的な差分、不正なファイルターゲット、不正なパッチヘッダなど、系統的な障害モードが明らかにされている。以上の結果から,従来のエージェントプログラミングベンチマークでは長文推論を有意に評価していないことが示唆された。

論文の概要: The Limits of Long-Context Reasoning in Automated Bug Fixing

関連論文リスト