Fugu-MT 論文翻訳(概要): Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

論文の概要: Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

arxiv url: http://arxiv.org/abs/2606.01682v1
Date: Mon, 01 Jun 2026 04:43:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:31.371628
Title: Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning
Title（参考訳）: プロセススコーラとしてのオフザシェルLDM:数理推論のためのPRMのトレーニングフリー代替品
Authors: Atoosa Chegini, Soheil Feizi,
Abstract要約: Chunk-Level Guided Generationは、既製の大規模言語モデルをプロセススコアラとして使用する、トレーニング不要の代替手段である。本研究では,系統的な長さバイアスのため,大モデル確率の可変長推論ステップが信頼できないことを示す。 Chunk-Level Guided Generation は PRM guided search よりもかなり短い推論トレースを生成する。
参考スコア（独自算出の注目度）: 51.88950852117154
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Selecting the best response from multiple small-model samples using a stronger scorer is a simple inference-time strategy, but fails when the small model has already committed to incorrect reasoning paths. PRM guided search avoids this by scoring candidate continuations during generation, but requires a reward model trained with step-level labels. We propose Chunk-Level Guided Generation, a training-free alternative that uses an off-the-shelf large language model as a process scorer. At each step, a small model samples k fixed-length candidate chunks, while the larger model scores the candidates using likelihoods without generating any text. The selected chunk is committed before the next step, steering generation before errors can propagate. We instantiate this framework with two selection rules: Likelihood-Guided Selection (LGS), which selects the chunk with the highest length-normalized large-model log-probability, and Contrastive-Guided Selection (CGS), which subtracts the small model's log-probability to favor chunks where the large model's preference diverges from the small model's. We show that scoring variable-length reasoning steps with large-model likelihoods is unreliable due to a systematic length bias that persists even after length normalization, and that fixed-length chunks avoid this confound. On GSM8K, MATH, Minerva Math, AMC23, and AIME24 with Qwen2.5-1.5B guided by Qwen2.5-32B and Llama-3.2-1B guided by Llama-3.1-70B, CGS outperforms majority voting by up to 28 pp and, under matched guidance budgets, matches or outperforms Qwen2.5-Math-PRM-72B guided search on most benchmarks without reward-model training. With Qwen2.5-7B guided by Qwen2.5-72B, CGS reaches 81.8% on MATH and 63.6% on Minerva Math at k=16, surpassing majority voting by 4--6 pp. Finally, Chunk-Level Guided Generation produces substantially shorter reasoning traces than PRM guided search.
Abstract（参考訳）: より強力なスコアラを使用して複数の小さなモデルサンプルから最良のレスポンスを選択することは、単純な推論時戦略であるが、小さなモデルがすでに誤った推論パスにコミットしている場合に失敗する。 PRMガイド検索は、生成中の候補継続をスコアリングすることでこれを回避しているが、ステップレベルのラベルでトレーニングされた報酬モデルが必要である。プロセススコアラとして,既製の大規模言語モデルを用いたトレーニングフリーのChunk-Level Guided Generationを提案する。各ステップで、小さなモデルは、固定長の候補チャンクをサンプリングし、大きなモデルは、テキストを生成せずに、可能性を使って候補をスコア付けする。選択されたチャンクは次のステップの前にコミットされ、エラーが伝播する前に生成を操る。このフレームワークには,最大長正規化大モデル対数確率のチャンクを選別するLikelihood-Guided Selection(LGS)と,小モデルの対数確率を減じて大モデルの選好が小モデルから分岐するチャンクを選別するContrastive-Guided Selection(CGS)という2つの選択ルールがある。本研究では,大モデル確率による可変長推論ステップのスコアリングは,長さ正規化後も持続する体系的長さバイアスのため信頼性が低いことを示し,固定長チャンクは,この矛盾を回避する。 GSM8K, MATH, Minerva Math, AMC23, AIME24 では Qwen2.5-1.5B が Qwen2.5-32B 、 Llama-3.2-1B が Llama-3.1-70B が GSM8K, MATH, Minerva Math, AMC23, AIME24 が Llama-3.1-70B が Qwen2.5-1.5B に導いた。 Qwen2.5-7BがQwen2.5-72Bに導かれ、CGSはMATHで81.8%、ミネルバ数学で63.6%、k=16で過半数を4--6ppで上回った。最後に、チャンクレベル誘導生成は、PRM誘導探索よりもかなり短い推論トレースを生成する。

論文の概要: Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

関連論文リスト