Fugu-MT 論文翻訳(概要): Retrieval and Multi-Hop Reasoning in 1M-Token Context Windows: Evaluating LLMs on Classical Chinese Text

論文の概要: Retrieval and Multi-Hop Reasoning in 1M-Token Context Windows: Evaluating LLMs on Classical Chinese Text

arxiv url: http://arxiv.org/abs/2605.02173v1
Date: Mon, 04 May 2026 03:05:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:50.118456
Title: Retrieval and Multi-Hop Reasoning in 1M-Token Context Windows: Evaluating LLMs on Classical Chinese Text
Title（参考訳）: 100M-Token コンテキストウィンドウにおける検索とマルチホップ推論:古典中国語テキストによるLLMの評価
Authors: Eric H. C. Chow,
Abstract要約: 広告付き1M-tokenコンテキストウィンドウを用いた5つのフロンティア大言語モデルの長文検索と推論能力の評価を行った。 1Mでのシングルニードル検索は、最強モデルに対して本質的に解決されている。現在の1Mコンテキストの旗艦間の最も鋭い差別化要因は512Kから1Mへの移行である。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We evaluate the long-context retrieval and reasoning capabilities of five frontier large language models with advertised 1M-token context windows on a classical Chinese corpus. Two complementary studies are reported. Test 1 measures single-needle retrieval at 1M tokens of input, with three biographical needles planted at three depths and pairs of real (training-prior-consistent) and altered (training-prior-contradicting) variants to separate genuine in-context retrieval from reliance on memorised training data. Test 2, a follow-up designed to probe whether long-context capability degrades when retrieval requires intermediate reasoning, measures three-hop chain traversal across three context tiers (256K, 512K, and 1M tokens). We find that single-needle retrieval at 1M is essentially solved for the strongest models - Gemini 3.1 Pro, Claude Opus 4.7, and GPT-5.5 each achieve 100% - but that multi-hop performance reveals three distinct decay signatures: a stable regime (Gemini Pro, Claude) maintaining greater than 80% accuracy through 512K with modest degradation at 1M; a late-cliff regime (GPT-5.5, Qwen3.6-plus) collapsing sharply between 512K and 1M; and a smooth-decline regime (DeepSeek V4 Pro) decaying gradually across the entire range. The findings suggest that nominal context-window length is a poor proxy for usable long-context multi-hop capability, and that the sharpest discriminator between current 1M-context flagships is the 512K-to-1M transition.
Abstract（参考訳）: 古典中国語コーパス上の1M-tokenコンテキストウィンドウを宣伝した5つのフロンティア大言語モデルの長文検索と推論能力の評価を行った。 2つの相補的な研究が報告されている。 Test 1は、入力の1Mトークンで単針検索を計測し、3つの伝記針を3つの深さに植え、実(訓練-優先-一貫性)と変化(訓練-事前-矛盾)の変異体を植え込み、記憶されたトレーニングデータに依存する真正の本文検索を分離する。テスト2は、3つのコンテキスト層(256K, 512K, 1Mトークン)にわたる3つのホップチェーンのトラバースを測定する。 Gemini 3.1 Pro, Claude Opus 4.7, GPT-5.5 の2つの最強モデルに対して,1M でのシングルニードル検索は基本的に解決されている。しかしマルチホップ性能は,安定なレギュレーション (Gemini Pro, Claude) が 1M でのモデスト劣化を伴う512K による80%以上の精度を維持すること,遅延クリフレギュレーション (GPT-5.5, Qwen3.6-plus) が 512K と 1M の間で急激に崩壊すること,スムーズなデクリニングレギュレーション (DeepSeek V4 Pro) が全範囲にわたって徐々に崩壊すること,の3つの異なる崩壊シグネチャを明らかにしている。この結果から, コンテキストウィンドウ長は, 使用可能な長文マルチホップの指標として不十分であり, 現在の1M-コンテキストフラッグシップ間の最も鋭い差別化要因は512K-to-1M遷移であることが示唆された。

論文の概要: Retrieval and Multi-Hop Reasoning in 1M-Token Context Windows: Evaluating LLMs on Classical Chinese Text

関連論文リスト