Fugu-MT 論文翻訳(概要): LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning

論文の概要: LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning

arxiv url: http://arxiv.org/abs/2604.14140v1
Date: Wed, 15 Apr 2026 17:58:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-16 20:38:32.669018
Title: LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning
Title（参考訳）: LongCoT:ロングホライゾン・チェーン・オブ・ソート推論のベンチマーク
Authors: Sumeet Ramesh Motwani, Daniel Nichols, Charles London, Peggy Li, Fabio Pizzati, Acer Blake, Hasan Hammoud, Tavish McDonald, Akshat Naik, Alesia Ivanova, Vignesh Baskaran, Ivan Laptev, Ruben Glatt, Tal Ben-Nun, Philip Torr, Natasha Jaques, Ameya Prabhu, Brian Bartoldson, Bhavya Kailkhura, Christian Schroeder de Witt,
Abstract要約: LongCoTは、化学、数学、計算機科学、チェス、論理学にまたがる2500の専門家によって設計された問題のスケーラブルなベンチマークである。 LongCoTは長い水平推論の厳密な尺度を提供し、フロンティアモデルが長期にわたって確実に推論できる能力を追跡する。
参考スコア（独自算出の注目度）: 50.27907326876949
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. An essential component of this ability is planning and managing a long, complex chain-of-thought (CoT). We introduce LongCoT, a scalable benchmark of 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic to isolate and directly measure the long-horizon CoT reasoning capabilities of frontier models. Problems consist of a short input with a verifiable answer; solving them requires navigating a graph of interdependent steps that span tens to hundreds of thousands of reasoning tokens. Each local step is individually tractable for frontier models, so failures reflect long-horizon reasoning limitations. At release, the best models achieve <10% accuracy (GPT 5.2: 9.8%; Gemini 3 Pro: 6.1%) on LongCoT, revealing a substantial gap in current capabilities. Overall, LongCoT provides a rigorous measure of long-horizon reasoning, tracking the ability of frontier models to reason reliably over extended periods.
Abstract（参考訳）: 言語モデルは、複雑な自律的なタスクのためにますますデプロイされるので、より長い地平線を正確に推論する能力が重要になる。この能力の重要なコンポーネントは、長い複雑なチェーン・オブ・ソート(CoT)を計画し、管理することである。これは、化学、数学、計算機科学、チェス、論理学にまたがる2500の専門家が設計した問題のスケーラブルなベンチマークで、フロンティアモデルの長距離CoT推論能力を分離し、直接測定する。問題の解決には、数十から数十万の推論トークンにまたがる、相互依存的なステップのグラフをナビゲートする必要がある。各局所ステップはフロンティアモデルに対して個別に牽引可能であるので、失敗は長い水平推論の制限を反映する。リリース時に最高のモデルでは、LongCoT上で<10%の精度(GPT 5.2: 9.8%、Gemini 3 Pro: 6.1%)を達成した。全体として、LongCoTは長い水平推論の厳密な尺度を提供し、フロンティアモデルが長期にわたって確実に推論できる能力を追跡している。

論文の概要: LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning

関連論文リスト