Related papers: Reinforcing General Reasoning without Verifiers

Reinforcing General Reasoning without Verifiers

URL: http://arxiv.org/abs/2505.21493v1
Date: Tue, 27 May 2025 17:56:27 GMT
Title: Reinforcing General Reasoning without Verifiers
Authors: Xiangxin Zhou, Zichen Liu, Anya Sims, Haonan Wang, Tianyu Pang, Chongxuan Li, Liang Wang, Min Lin, Chao Du,
Abstract summary: We propose a verifier-free method (VeriFree) that bypasses answer verification and instead uses RL to directly maximize the probability of generating the reference answer.<n>VeriFree matches and even surpasses verifier-based methods on extensive evaluations across MMLU-Pro, GPQA, SuperGPQA, and math-related benchmarks.
Score: 47.72684162518086
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The recent paradigm shift towards training large language models (LLMs) using DeepSeek-R1-Zero-style reinforcement learning (RL) on verifiable rewards has led to impressive advancements in code and mathematical reasoning. However, this methodology is limited to tasks where rule-based answer verification is possible and does not naturally extend to real-world domains such as chemistry, healthcare, engineering, law, biology, business, and economics. Current practical workarounds use an additional LLM as a model-based verifier; however, this introduces issues such as reliance on a strong verifier LLM, susceptibility to reward hacking, and the practical burden of maintaining the verifier model in memory during training. To address this and extend DeepSeek-R1-Zero-style training to general reasoning domains, we propose a verifier-free method (VeriFree) that bypasses answer verification and instead uses RL to directly maximize the probability of generating the reference answer. We compare VeriFree with verifier-based methods and demonstrate that, in addition to its significant practical benefits and reduced compute requirements, VeriFree matches and even surpasses verifier-based methods on extensive evaluations across MMLU-Pro, GPQA, SuperGPQA, and math-related benchmarks. Moreover, we provide insights into this method from multiple perspectives: as an elegant integration of training both the policy and implicit verifier in a unified model, and as a variational optimization approach. Code is available at https://github.com/sail-sg/VeriFree.

Related papers

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
Revisiting LLM Reasoning via Information Bottleneck [57.519119962528166]
Large language models (LLMs) have recently demonstrated remarkable progress in reasoning capabilities through reinforcement learning with verifiable rewards (RLVR)<n>We present a theoretical characterization of LLM reasoning grounded in information bottleneck (IB) principle.<n>We propose IB-aware reasoning optimization (IBRO), a framework that encourages reasoning trajectories to be both informative about the final correct answer and generalizable.
arXiv Detail & Related papers (2025-07-24T13:14:25Z)
wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models [15.638885149395657]
Intractability of dLLMs likelihood function requires approximating the current, old, and reference policy likelihoods at each policy optimization step.<n>We introduce $mathttwd1$, a novel policy optimization approach that reformulates the objective as a weighted likelihood.<n>Experiments on widely used reasoning benchmarks demonstrate that $mathttwd1$, without supervised fine-tuning (SFT) or any supervised data, outperforms existing RL methods for dLLMs.
arXiv Detail & Related papers (2025-07-07T21:27:25Z)
VerIF: Verification Engineering for Reinforcement Learning in Instruction Following [55.60192044049083]
Reinforcement learning with verifiable rewards (RLVR) has become a key technique for enhancing large language models (LLMs)<n>We propose VerIF, a verification method that combines rule-based code verification with LLM-based verification from a large reasoning model.<n>We apply RL training with VerIF to two models, achieving significant improvements across several representative instruction-following benchmarks.
arXiv Detail & Related papers (2025-06-11T17:10:36Z)
Response-Level Rewards Are All You Need for Online Reinforcement Learning in LLMs: A Mathematical Perspective [6.069069082518759]
We study the Zero-Reward Assumption in reinforcement learning for large language models (LLMs)<n>We show that the policy gradient based on true, unknown token-level rewards can be unbiasedly estimated using only a response-level reward model.<n>We propose a new algorithm: Token-Reinforced Policy Optimization (TRePO)
arXiv Detail & Related papers (2025-06-03T07:44:31Z)
Pitfalls of Rule- and Model-based Verifiers -- A Case Study on Mathematical Reasoning [26.717777746219635]
We take mathematical reasoning as a case study and conduct a comprehensive analysis of various verifiers in both static evaluation and RL training scenarios.<n>First, we find that current open-source rule-based verifiers often fail to recognize equivalent answers presented in different formats across commonly used mathematical datasets, resulting in non-negligible false negative rates.<n>We investigate model-based verifiers as a potential solution to address these limitations.<n>While the static evaluation shows that model-based verifiers achieve significantly higher verification accuracy, further analysis and RL training results imply that they are highly susceptible to hacking, where they misclassify certain patterns
arXiv Detail & Related papers (2025-05-28T10:28:41Z)
Learning to Reason without External Rewards [100.27210579418562]
Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision.<n>We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data.<n>We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal.
arXiv Detail & Related papers (2025-05-26T07:01:06Z)
TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning [11.573904453859098]
Reinforcement Learning (RL) has become a powerful tool for enhancing the reasoning abilities of large language models (LLMs)<n>Yet, RL's success relies on the reliability of rewards, which are provided by verifiers.<n>In this paper, we expose and analyze a widespread problem--false negatives--where verifiers wrongly reject correct model outputs.<n>We propose tinyV, a lightweight LLM-based verifier that augments existing rule-based methods.
arXiv Detail & Related papers (2025-05-20T17:16:44Z)
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning. LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors. We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z)
Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities. In-Context Learning (ICL) and. Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting. LLMs to downstream tasks. We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z)
Generative Verifiers: Reward Modeling as Next-Token Prediction [29.543787728397643]
We propose training verifiers using the ubiquitous next-token prediction objective, jointly on verification and solution generation.<n>Compared to standard verifiers, such generative verifiers (GenRM) can benefit from several advantages of LLMs.<n>We observe improvements of 28% $rightarrow$ 44.6% on MATH, and 37.9% $rightarrow$ 53.5% on MMLU abstract algebra.
arXiv Detail & Related papers (2024-08-27T17:57:45Z)
Towards Effective Evaluations and Comparisons for LLM Unlearning Methods [97.2995389188179]
This paper seeks to refine the evaluation of machine unlearning for large language models.<n>It addresses two key challenges -- the robustness of evaluation metrics and the trade-offs between competing goals.
arXiv Detail & Related papers (2024-06-13T14:41:00Z)
Value Functions are Control Barrier Functions: Verification of Safe Policies using Control Theory [46.85103495283037]
We propose a new approach to apply verification methods from control theory to learned value functions. We formalize original theorems that establish links between value functions and control barrier functions. Our work marks a significant step towards a formal framework for the general, scalable, and verifiable design of RL-based control systems.
arXiv Detail & Related papers (2023-06-06T21:41:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.