Fugu-MT 論文翻訳(概要): What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT

論文の概要: What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT

arxiv url: http://arxiv.org/abs/2509.19284v1
Date: Tue, 23 Sep 2025 17:50:54 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-24 20:41:27.98596
Title: What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT
Title（参考訳）: 効果的な推論の特色は何か : CoT の長さ, 評価, 構造の再検討
Authors: Yunzhen Feng, Julia Kempe, Cheng Zhang, Parag Jain, Anthony Hartshorn,
Abstract要約: 単純なCoT延長とレビューの増加は,*より低い*精度と関連していることがわかった。構造を抽出し,単一統計量を特定するために,CoTのグラフビューを導入する。これらの結果は、有効な CoT を *fail less* であり、*structure-aware* テストタイムスケーリングをサポートするものとして特徴づけます。
参考スコア（独自算出の注目度）: 23.890290314477273
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large reasoning models (LRMs) spend substantial test-time compute on long chain-of-thought (CoT) traces, but what *characterizes* an effective CoT remains unclear. While prior work reports gains from lengthening CoTs and increasing review (revisiting earlier steps) via appended *wait* tokens, recent studies suggest that shorter thinking can outperform longer traces. We therefore conduct a systematic evaluation across ten LRMs on math and scientific reasoning. Contrary to the "longer-is-better" narrative, we find that both naive CoT lengthening and increased review are associated with *lower* accuracy. As CoT unfolds step by step, token-level metrics can conflate verbosity with process quality. We introduce a graph view of CoT to extract structure and identify a single statistic-the *Failed-Step Fraction (FSF)*, the fraction of steps in abandoned branches-that consistently outpredicts length and review ratio for correctness across models. To probe causality, we design two interventions. First, we rank candidate CoTs by each metric at test time, where FSF yields the largest pass@1 gains; second, we edit CoTs to remove failed branches, which significantly improves accuracy, indicating that failed branches bias subsequent reasoning. Taken together, these results characterize effective CoTs as those that *fail less* and support *structure-aware* test-time scaling over indiscriminately generating long CoT.
Abstract（参考訳）: 大きな推論モデル(LRM)は長いチェーン・オブ・シークレット(CoT)のトレースにかなりのテストタイムの計算を費やすが、*効果的なCoTとは何か? 以前の作業報告では,CoTの延長や,追加の *wait* トークンによるレビュー(以前のステップの再検討)の増加によって,より短い思考でトレースが長くなることが示唆されている。そこで本研究では,数学と科学的推論の10分野にわたる体系的評価を行った。より長い物語とは対照的に、単純CoTの延長とレビューの増加は、*より低い*精度と関連している。 CoTはステップごとに展開するので、トークンレベルのメトリクスは冗長性をプロセスの品質と説明できます。我々はCoTのグラフビューを導入し、構造を抽出し、1つの統計値である*Failed-Step Fraction (FSF)*を同定する。因果関係を調査するために2つの介入を設計する。次に、失敗したブランチを削除するためにCoTを編集し、精度を大幅に向上させ、失敗したブランチがその後の推論に偏っていることを示す。まとめると、これらの結果は有効なCoTを*フェイルを減らし、*structure-aware*テストタイムのスケーリングをサポートするものとして特徴づけます。

論文の概要: What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT

関連論文リスト