Fugu-MT 論文翻訳(概要): Context Length Alone Hurts LLM Performance Despite Perfect Retrieval

論文の概要: Context Length Alone Hurts LLM Performance Despite Perfect Retrieval

arxiv url: http://arxiv.org/abs/2510.05381v1
Date: Mon, 06 Oct 2025 21:17:13 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-08 17:57:07.988952
Title: Context Length Alone Hurts LLM Performance Despite Perfect Retrieval
Title（参考訳）: 完全検索にもかかわらず、文脈長がLLM性能を損なう
Authors: Yufeng Du, Minyang Tian, Srikanth Ronanki, Subendhu Rongali, Sravan Bodapati, Aram Galstyan, Azton Wells, Roy Schwartz, Eliu A Huerta, Hao Peng,
Abstract要約: 大規模言語モデル(LLM)は、サポート対象のコンテキスト長に合わせて、長いコンテキストタスクのパフォーマンスをスケールできないことが多い。本稿では,この問題に対する回答が否定的である可能性が示唆された。
参考スコア（独自算出の注目度）: 29.523005523787244
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) often fail to scale their performance on long-context tasks performance in line with the context lengths they support. This gap is commonly attributed to retrieval failures -- the models' inability to identify relevant information in the long inputs. Accordingly, recent efforts often focus on evaluating and improving LLMs' retrieval performance: if retrieval is perfect, a model should, in principle, perform just as well on a long input as it does on a short one -- or should it? This paper presents findings that the answer to this question may be negative. Our systematic experiments across 5 open- and closed-source LLMs on math, question answering, and coding tasks reveal that, even when models can perfectly retrieve all relevant information, their performance still degrades substantially (13.9%--85%) as input length increases but remains well within the models' claimed lengths. This failure occurs even when the irrelevant tokens are replaced with minimally distracting whitespace, and, more surprisingly, when they are all masked and the models are forced to attend only to the relevant tokens. A similar performance drop is observed when all relevant evidence is placed immediately before the question. Our findings reveal a previously-unrealized limitation: the sheer length of the input alone can hurt LLM performance, independent of retrieval quality and without any distraction. They motivate our simple, model-agnostic mitigation strategy that transforms a long-context task into a short-context one by prompting the model to recite the retrieved evidence before attempting to solve the problem. On RULER, we observe a consistent improvement of GPT-4o up to 4% on an already strong baseline.
Abstract（参考訳）: 大規模言語モデル(LLM)は、長文タスクのパフォーマンスを、サポート対象のコンテキストの長さに合わせてスケールできないことが多い。このギャップは一般的に、長い入力で関連する情報を識別できないモデルによる検索失敗によるものである。従って、最近の取り組みでは、LLMの検索性能の評価と改善に重点を置いている。本稿では,この問題に対する回答が否定的である可能性が示唆された。数学、質問応答、コーディングタスクに関する5つのオープンおよびクローズド・ソース LLM の体系的な実験により、モデルがすべての関連情報を完全に取り出すことができたとしても、入力長が増加するにつれて性能は大幅に低下する(13.9%～85%)が、モデルが主張する長さの範囲内では良好に保たれていることが判明した。この失敗は、無関係なトークンが最小限に散らばるホワイトスペースに置き換えられた場合でも発生し、さらに驚くべきことに、すべてのトークンがマスクされ、モデルが関連するトークンにのみ参加せざるを得なくなる。関連するすべての証拠が質問の直前に置かれると、同様の性能低下が観測される。入力の重みだけはLLMの性能を損なうことがあり, 検索品質とは無関係であり, 注意を払わない。彼らは、長いコンテキストタスクを短いコンテキストに変換する、単純でモデルに依存しない緩和戦略を動機付けます。 RULERでは,すでに強いベースライン上でGPT-4oを最大4%改善する。

論文の概要: Context Length Alone Hurts LLM Performance Despite Perfect Retrieval

関連論文リスト