Fugu-MT 論文翻訳(概要): seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLMs

論文の概要: seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLMs

arxiv url: http://arxiv.org/abs/2509.16866v1
Date: Sun, 21 Sep 2025 01:32:13 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-23 18:58:16.012592
Title: seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLMs
Title（参考訳）: seqBench: LLMのシーケンス推論限界を定量化する可変ベンチマーク
Authors: Mohammad Ramezanali, Mo Vazifeh, Paolo Santi,
Abstract要約: 我々は,Large Language Models (LLMs) における逐次推論限界を探索するベンチマークであるseqBenchを紹介する。検索の複雑さが最小限であるにもかかわらず、セクベンチの構造的推論タスクでは、トップパフォーマンスモデルでさえ体系的に失敗することがわかった。
参考スコア（独自算出の注目度）: 1.0519693622157462
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce seqBench, a parametrized benchmark for probing sequential reasoning limits in Large Language Models (LLMs) through precise, multi-dimensional control over several key complexity dimensions. seqBench allows systematic variation of (1) the logical depth, defined as the number of sequential actions required to solve the task; (2) the number of backtracking steps along the optimal path, quantifying how often the agent must revisit prior states to satisfy deferred preconditions (e.g., retrieving a key after encountering a locked door); and (3) the noise ratio, defined as the ratio between supporting and distracting facts about the environment. Our evaluations on state-of-the-art LLMs reveal a universal failure pattern: accuracy collapses exponentially beyond a model-specific logical depth. Unlike existing benchmarks, seqBench's fine-grained control facilitates targeted analyses of these reasoning failures, illuminating universal scaling laws and statistical limits, as detailed in this paper alongside its generation methodology and evaluation metrics. We find that even top-performing models systematically fail on seqBench's structured reasoning tasks despite minimal search complexity, underscoring key limitations in their commonsense reasoning capabilities. Designed for future evolution to keep pace with advancing models, the seqBench datasets are publicly released to spur deeper scientific inquiry into LLM reasoning, aiming to establish a clearer understanding of their true potential and current boundaries for robust real-world application.
Abstract（参考訳）: 本稿では,Large Language Models (LLMs) における逐次推論限界のパラメタライズドベンチマークであるseqBenchを紹介する。 seqBenchは、(1)タスクの解決に必要なシーケンシャルなアクションの数として定義された論理的な深さ、(2)最適な経路に沿ったバックトラックステップの数、2)遅延した前提条件を満たすためにエージェントがどれくらいの頻度で事前状態を見直しなければならないかの定量化(例えば、ロックされたドアに遭遇した後鍵を回収する必要がある)、(3)環境に関する事実の支援と注意をそらすための比率として定義されたノイズ比の体系的な変化を可能にする。現状のLLMに対する評価では、モデル固有の論理深度を超える精度が指数関数的に崩壊するという、普遍的な失敗パターンが示される。既存のベンチマークとは異なり、セクベンチの微粒化制御はこれらの推論失敗の標的分析を促進し、この論文で詳述したように、普遍的なスケーリング法則と統計的限界を照らし出す。検索の複雑さが最小限であるにもかかわらず、セクベンチの構造化推論タスクでは、トップパフォーマンスモデルでさえ体系的に失敗し、コモンセンス推論機能において重要な制限を過小評価している。先進的なモデルとペースを維持するために、将来の進化のために設計されたセクベンチデータセットは、LLM推論に関するより深い科学的調査を促進するために、公開され、彼らの真のポテンシャルと、堅牢な現実世界のアプリケーションに対する現在のバウンダリを明確に理解することを目的としている。

論文の概要: seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLMs

関連論文リスト