Fugu-MT 論文翻訳(概要): DC-W2S: Dual-Consensus Weak-to-Strong Training for Reliable Process Reward Modeling in Biological Reasoning

論文の概要: DC-W2S: Dual-Consensus Weak-to-Strong Training for Reliable Process Reward Modeling in Biological Reasoning

arxiv url: http://arxiv.org/abs/2603.08095v1
Date: Mon, 09 Mar 2026 08:36:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:15.708993
Title: DC-W2S: Dual-Consensus Weak-to-Strong Training for Reliable Process Reward Modeling in Biological Reasoning
Title（参考訳）: DC-W2S:生物共振における信頼性プロセスリワードモデリングのためのデュアルコンセンサス弱ストロングトレーニング
Authors: Chi-Min Chan, Ehsan Hajiramezanali, Xiner Li, Edward De Brouwer, Carl Edwards, Wei Xue, Sirui Han, Yike Guo, Gabriele Scalia,
Abstract要約: 本稿では,多量だがノイズの多い「弱」監視を用いた信頼性PRMの訓練の課題について論じる。既存の Weak-to-Strong Generalization theory には、ノイズの多いデータから高品質なトレーニング信号を選択するための規範的なガイドラインがない。我々は、トレーニングプロセスの指針として、インスタンスレベルのバランスの取れたサンプリングとラベルレベルの信頼性を意識したマスキングのカリキュラムを採用している。
参考スコア（独自算出の注目度）: 43.0861898113022
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In scientific reasoning tasks, the veracity of the reasoning process is as critical as the final outcome. While Process Reward Models (PRMs) offer a solution to the coarse-grained supervision problems inherent in Outcome Reward Models (ORMs), their deployment is hindered by the prohibitive cost of obtaining expert-verified step-wise labels. This paper addresses the challenge of training reliable PRMs using abundant but noisy "weak" supervision. We argue that existing Weak-to-Strong Generalization (W2SG) theories lack prescriptive guidelines for selecting high-quality training signals from noisy data. To bridge this gap, we introduce the Dual-Consensus Weak-to-Strong (DC-W2S) framework. By intersecting Self-Consensus (SC) metrics among weak supervisors with Neighborhood-Consensus (NC) metrics in the embedding space, we stratify supervision signals into distinct reliability regimes. We then employ a curriculum of instance-level balanced sampling and label-level reliability-aware masking to guide the training process. We demonstrate that DC-W2S enables the training of robust PRMs for complex reasoning without exhaustive expert annotation, proving that strategic data curation is more effective than indiscriminate training on large-scale noisy datasets.
Abstract（参考訳）: 科学的推論タスクでは、推論プロセスの正確さは最終的な結果と同じくらい批判的である。プロセス・リワード・モデル(PRM)は、アウトカム・リワード・モデル(ORM)に固有の粗い粒度の監督問題に対する解決策を提供するが、それらの展開は、専門家が検証したステップワイドなラベルを取得するという禁止的なコストによって妨げられる。本稿では,多量だがノイズの多い「弱」監視を用いた信頼性PRMの訓練の課題について論じる。 Weak-to-Strong Generalization (W2SG)理論はノイズデータから高品質なトレーニング信号を選択するための規範的ガイドラインを欠いている。このギャップを埋めるために、Dual-Consensus Weak-to-Strong (DC-W2S)フレームワークを紹介します。埋め込み空間における自己合意(SC)メトリクスと近隣合意(NC)メトリクスとを交差させることにより、監視信号を異なる信頼性体制に階層化する。次に、トレーニングプロセスの指針として、インスタンスレベルのバランスの取れたサンプリングとラベルレベルの信頼性を意識したマスキングのカリキュラムを用いる。我々は,DC-W2Sが複雑な推論のための堅牢なPRMのトレーニングを,大規模ノイズデータセットの非差別的トレーニングよりも戦略的データキュレーションの方が効果的であることを実証した。

論文の概要: DC-W2S: Dual-Consensus Weak-to-Strong Training for Reliable Process Reward Modeling in Biological Reasoning

関連論文リスト