Fugu-MT 論文翻訳(概要): Reasoning Path Divergence: A New Metric and Curation Strategy to Unlock LLM Diverse Thinking

論文の概要: Reasoning Path Divergence: A New Metric and Curation Strategy to Unlock LLM Diverse Thinking

arxiv url: http://arxiv.org/abs/2510.26122v1
Date: Thu, 30 Oct 2025 04:08:53 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-31 16:05:09.654669
Title: Reasoning Path Divergence: A New Metric and Curation Strategy to Unlock LLM Diverse Thinking
Title（参考訳）: 経路の多様性を推論する: LLM の多元的思考を解き放つための新しいメトリクスとカリキュラム戦略
Authors: Feng Ju, Zeyu Qin, Rui Min, Zhitao He, Lingpeng Kong, Yi R. Fung,
Abstract要約: テスト時間スケーリング(TTS)は,大規模言語モデル(LLM)の推論能力向上に有効であることが証明された。提案手法は「一問題・複数解」(1PNS)の学習パラダイムであり,モデルから妥当な推論軌跡を抽出する手法である。 Reasoning Path Divergence (RPD) を用いて、問題ごとの最大多様な解集合と微調整Qwen3-4B-Baseをキュレートする。
参考スコア（独自算出の注目度）: 49.8843966537226
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While Test-Time Scaling (TTS) has proven effective in improving the reasoning ability of large language models (LLMs), low diversity in model outputs often becomes a bottleneck; this is partly caused by the common "one problem, one solution" (1P1S) training practice, which provides a single canonical answer and can push models toward a narrow set of reasoning paths. To address this, we propose a "one problem, multiple solutions" (1PNS) training paradigm that exposes the model to a variety of valid reasoning trajectories and thus increases inference diversity. A core challenge for 1PNS is reliably measuring semantic differences between multi-step chains of thought, so we introduce Reasoning Path Divergence (RPD), a step-level metric that aligns and scores Long Chain-of-Thought solutions to capture differences in intermediate reasoning. Using RPD, we curate maximally diverse solution sets per problem and fine-tune Qwen3-4B-Base. Experiments show that RPD-selected training yields more varied outputs and higher pass@k, with an average +2.80% gain in pass@16 over a strong 1P1S baseline and a +4.99% gain on AIME24, demonstrating that 1PNS further amplifies the effectiveness of TTS. Our code is available at https://github.com/fengjujf/Reasoning-Path-Divergence .
Abstract（参考訳）: テスト時間スケーリング(TTS)は、大きな言語モデル(LLM)の推論能力を改善するのに有効であることが証明されているが、モデル出力の多様性が低いことは、しばしばボトルネックとなる。この問題に対処するため,我々は,モデルが妥当な推論トラジェクトリに公開され,推論の多様性が増大する「1つの問題,複数解」(1PNS)トレーニングパラダイムを提案する。 1PNSの中核的な課題は、多段階の思考の連鎖間の意味的差異を確実に測定することであり、中間的推論における差異を捉えるために、Long Chain-of-Thoughtソリューションを整列し、スコアするステップレベルの計量であるReasoning Path Divergence (RPD)を導入することである。 RPDを用いて、問題ごとの最大多様な解集合と微調整Qwen3-4B-Baseをキュレートする。実験の結果、PD選択したトレーニングはより多様な出力とより高いパス@kをもたらし、パス@16は強い1P1Sベースラインよりも平均+2.80%上昇し、AIME24では+4.99%上昇し、1PNSはTSの有効性をさらに増幅することが示された。私たちのコードはhttps://github.com/fengjujf/Reasoning-Path-Divergence で利用可能です。

論文の概要: Reasoning Path Divergence: A New Metric and Curation Strategy to Unlock LLM Diverse Thinking

関連論文リスト