Fugu-MT 論文翻訳(概要): OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification

論文の概要: OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification

arxiv url: http://arxiv.org/abs/2512.10756v1
Date: Thu, 11 Dec 2025 15:47:38 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-12 16:15:42.45352
Title: OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification
Title（参考訳）: OPV: 有効長鎖検証のためのアウトカムベースプロセス検証器
Authors: Zijian Wu, Lingkai Kong, Wenwei Zhang, Songyang Gao, Yuzhe Gu, Zhongrui Cai, Tianyou Ma, Yuhong Liu, Zhi Wang, Runyuan Ma, Guangyu Wang, Wei Li, Conghui He, Dahua Lin, Kai Chen,
Abstract要約: 本稿では、長い思考の連鎖から要約された結果の合理化過程を検証する、アウトカムベースプロセス検証(OPV)を提案する。 OPV は 76.3 と比較して F1 スコアが 83.1 の Qwen3-Max-Preview など,はるかに大きなオープンソースモデルよりも優れています。
参考スコア（独自算出の注目度）: 91.15649744496834
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the Outcome-based Process Verifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out OPV-Bench, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2% to 73.3% on AIME2025 as the compute budget scales.
Abstract（参考訳）: 大規模言語モデル (LLM) は、Reinforcement Learning with Verifiable Rewards (RLVR) による複雑な推論タスクの解決において大きな進歩を遂げている。この進歩は、信頼性の高い検証者によって自動化された監視とは分離できない。しかし、現在の結果に基づく検証器(OVs)は、長い推論連鎖(CoTs)における信頼できない中間段階を検査できない。一方、現行のプロセスベース検証器(PV)は、人間のアノテーションの禁止コストによる高品質なアノテーションの不足により制限され、複雑な長いCoTのエラーを確実に検出することが困難である。そこで我々は,長いCoTからの要約結果の合理化過程を検証し,正確かつ効率的な検証と大規模アノテーションの実現を可能にする,アウトカムベースプロセス検証(OPV)を提案する。提案手法を有効活用するために,専門家アノテーションを用いた反復型能動的学習フレームワークを導入し,より少ないアノテーションコストでOPVの検証能力を向上させる。具体的には、各イテレーションにおいて、現在の最高のOPVの最も不確実なケースは注釈付けされ、次に次のラウンドでRejection Fine-Tuning (RFT)とRLVRを通じて新しいOPVを訓練するために使用される。大規模な実験は、OPVの優れた性能と幅広い適用性を示している。 76.3と比較すると、Qwen3-Max-Previewのようなはるかに大きなオープンソースモデルよりも、F1スコアが83.1である。さらに、OPVは、専門家評価と密接に一致して、合成データセット内の偽陽性を効果的に検出する。 OPVは政策モデルと共同で、例えばDeepSeek-R1-Distill-Qwen-32B の精度を AIME2025 で 55.2% から 73.3% に向上させる。

論文の概要: OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification

関連論文リスト