OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification
- URL: http://arxiv.org/abs/2512.10756v1
- Date: Thu, 11 Dec 2025 15:47:38 GMT
- Title: OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification
- Authors: Zijian Wu, Lingkai Kong, Wenwei Zhang, Songyang Gao, Yuzhe Gu, Zhongrui Cai, Tianyou Ma, Yuhong Liu, Zhi Wang, Runyuan Ma, Guangyu Wang, Wei Li, Conghui He, Dahua Lin, Kai Chen,
- Abstract summary: We propose the Outcome-based Process Verifier (OPV), which verifies the rationale process of summarized outcomes from long chains of thought.<n>OPV achieves new state-of-the-art results on our held-out OPV-Bench, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3.
- Score: 91.15649744496834
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the Outcome-based Process Verifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out OPV-Bench, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2% to 73.3% on AIME2025 as the compute budget scales.
Related papers
- PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering [71.15346406323827]
We introduce PRIME, a benchmark for evaluating verifiers on Process-Outcome Alignment verification.<n>We find that current verifiers frequently fail to detect derivation flaws.<n>We propose a process-aware RLVR training paradigm utilizing verifiers selected via PRIME.
arXiv Detail & Related papers (2026-02-12T04:45:01Z) - ForesightKV: Optimizing KV Cache Eviction for Reasoning Models by Learning Long-Term Contribution [84.41751286055909]
We develop a training-based KV cache eviction framework that learns to predict which KV pairs to evict during longtext generations.<n>We formulate cache eviction as a Markov Decision Process and apply the GRPO algorithm to mitigate the significant language modeling loss increase on low-entropy tokens.
arXiv Detail & Related papers (2026-02-03T07:16:51Z) - CVeDRL: An Efficient Code Verifier via Difficulty-aware Reinforcement Learning [57.24524263804788]
Code verifiers play a critical role in post-verification for LLM-based code generation.<n>Existing supervised fine-tuning methods suffer from data scarcity, high failure rates, and poor inference efficiency.<n>We show that naive RL with only functionality rewards fails to generate effective unit tests for difficult branches and samples.
arXiv Detail & Related papers (2026-01-30T10:33:29Z) - Evidence-Augmented Policy Optimization with Reward Co-Evolution for Long-Context Reasoning [52.144281362465996]
We propose EAPO (Evidence-Augmented Policy Optimization) to apply Reinforcement Learning to long-context scenarios.<n>We first establish the Evidence-Augmented Reasoning paradigm, validating via Tree-Structured Evidence Sampling.<n>We then introduce a specialized RL algorithm where a reward model computes a Group-Relative Evidence Reward.<n>To sustain accurate supervision throughout training, we further incorporate an Adaptive Reward-Policy Co-Evolution mechanism.
arXiv Detail & Related papers (2026-01-15T11:40:57Z) - Segmental Advantage Estimation: Enhancing PPO for Long-Context LLM Training [17.530233901658253]
Segmental Advantage Estimation mitigates the bias that Generalized Advantage Estimation can incur in Reinforcement Learning with Verifiable Rewards.<n> SAE achieves superior performance, with marked improvements in final scores, stability, and sample efficiency.
arXiv Detail & Related papers (2026-01-12T08:41:47Z) - Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving [65.02106674311908]
We propose textbfOutcome-based textbfProcess textbfVerifier (OPV)<n>OPV verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification.<n>It achieves new state-of-the-art results on our held-out textscthisbench, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3.
arXiv Detail & Related papers (2025-12-11T15:26:28Z) - A Preliminary Study on the Promises and Challenges of Native Top-$k$ Sparse Attention [33.03212783462742]
This report conducts a preliminary investigation into the effectiveness and theoretical mechanisms of the Top-$k$ Attention mechanism.<n>Experiments demonstrate that exact Top-$k$ Decoding achieves performance comparable to, or even surpassing, full attention on downstream tasks.<n>Considering the high computational complexity of exact Top-$k$ Attention, we investigate the impact of approximate Top-$k$ algorithm precision on downstream tasks.
arXiv Detail & Related papers (2025-12-03T06:44:02Z) - HyPV-LEAD: Proactive Early-Warning of Cryptocurrency Anomalies through Data-Driven Structural-Temporal Modeling [0.0]
Abnormal cryptocurrency transactions pose escalating risks to financial integrity.<n>Existing approaches are predominantly model-centric and post hoc.<n>This paper introduces HyPV-LEAD, a data-driven early-warning framework.
arXiv Detail & Related papers (2025-09-03T12:23:38Z) - VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks [49.0793012627959]
We present VAPO, a novel framework tailored for reasoning models within the value-based paradigm.<n>VAPO attains a state-of-the-art score of $mathbf60.4$.<n>In direct comparison under identical experimental settings, VAPO outperforms the previously reported results of DeepSeek-R1-Zero-Qwen-32B and DAPO by more than 10 points.
arXiv Detail & Related papers (2025-04-07T14:21:11Z) - Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks.
We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.