Fugu-MT 論文翻訳(概要): Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities

論文の概要: Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities

arxiv url: http://arxiv.org/abs/2511.02817v1
Date: Tue, 04 Nov 2025 18:42:12 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-05 18:47:06.151849
Title: Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities
Title（参考訳）: Oolong: ロングコンテキスト推論とアグリゲーション能力の評価
Authors: Amanda Bertsch, Adithya Pratapa, Teruko Mitamura, Graham Neubig, Matthew R. Gormley,
Abstract要約: Oolongは、原子レベルで個々のテキストの塊を分析する必要がある長期コンテキスト推論タスクのベンチマークである。 Oolongでは、GPT-5、Claude-Sonnet-4、Gemini-2.5-Proといったフロンティアモデルでも、どちらも128Kで50%未満の精度を実現している。
参考スコア（独自算出の注目度）: 48.54193244645589
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As model context lengths continue to grow, concerns about whether models effectively use the full context length have persisted. While several carefully designed long-context evaluations have recently been released, these evaluations tend to rely on retrieval from one or more sections of the context, which allows nearly all of the context tokens to be disregarded as noise. This represents only one type of task that might be performed with long context. We introduce Oolong, a benchmark of long-context reasoning tasks that require analyzing individual chunks of text on an atomic level, and then aggregating these analyses to answer distributional questions. Oolong is separated into two task sets: Oolong-synth, a set of naturalistic synthetic tasks, where we can easily ablate components of the reasoning problem; and Oolong-real, a downstream setting which requires reasoning over real-world conversational data. Oolong requires models to reason over large quantities of examples, to perform both classification and counting in-context, and to reason over temporal and user relations. Even frontier models struggle on Oolong, with GPT-5, Claude-Sonnet-4, and Gemini-2.5-Pro all achieving less than 50% accuracy on both splits at 128K. We release the data and evaluation harness for Oolong to enable further development of models that can reason over large quantities of text.
Abstract（参考訳）: モデルコンテキスト長が増加し続けるにつれて、モデルがフルコンテキスト長を効果的に利用するかどうかに関する懸念が持続している。最近、いくつかの慎重に設計された長期コンテキスト評価がリリースされたが、これらの評価はコンテキストの1つ以上のセクションからの検索に依存する傾向にあり、ほとんどのコンテキストトークンはノイズとして無視される。これは、長いコンテキストで実行される可能性のあるタスクの1つのタイプを表す。 Oolongは、原子レベルで個々のテキストの塊を解析し、それらの分析を集約して、分散的な質問に答える、長いコンテキスト推論タスクのベンチマークである。 Oolong-synthは自然主義的な合成タスクの集合で、推論問題のコンポーネントを簡単にアブレーションできる。 Oolongは、大量のサンプルを推論し、分類と文脈内カウントの両方を実行し、時間的およびユーザ関係を推論するためにモデルを必要とする。 Oolongでは、GPT-5、Claude-Sonnet-4、Gemini-2.5-Proといったフロンティアモデルでも、どちらも128Kで50%未満の精度を実現している。我々はOolongのデータと評価ハーネスを公開し、大量のテキストを推論できるモデルの開発をさらに進める。

論文の概要: Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities

関連論文リスト