Reinforcing General Reasoning without Verifiers
- URL: http://arxiv.org/abs/2505.21493v1
- Date: Tue, 27 May 2025 17:56:27 GMT
- Title: Reinforcing General Reasoning without Verifiers
- Authors: Xiangxin Zhou, Zichen Liu, Anya Sims, Haonan Wang, Tianyu Pang, Chongxuan Li, Liang Wang, Min Lin, Chao Du,
- Abstract summary: We propose a verifier-free method (VeriFree) that bypasses answer verification and instead uses RL to directly maximize the probability of generating the reference answer.<n>VeriFree matches and even surpasses verifier-based methods on extensive evaluations across MMLU-Pro, GPQA, SuperGPQA, and math-related benchmarks.
- Score: 47.72684162518086
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent paradigm shift towards training large language models (LLMs) using DeepSeek-R1-Zero-style reinforcement learning (RL) on verifiable rewards has led to impressive advancements in code and mathematical reasoning. However, this methodology is limited to tasks where rule-based answer verification is possible and does not naturally extend to real-world domains such as chemistry, healthcare, engineering, law, biology, business, and economics. Current practical workarounds use an additional LLM as a model-based verifier; however, this introduces issues such as reliance on a strong verifier LLM, susceptibility to reward hacking, and the practical burden of maintaining the verifier model in memory during training. To address this and extend DeepSeek-R1-Zero-style training to general reasoning domains, we propose a verifier-free method (VeriFree) that bypasses answer verification and instead uses RL to directly maximize the probability of generating the reference answer. We compare VeriFree with verifier-based methods and demonstrate that, in addition to its significant practical benefits and reduced compute requirements, VeriFree matches and even surpasses verifier-based methods on extensive evaluations across MMLU-Pro, GPQA, SuperGPQA, and math-related benchmarks. Moreover, we provide insights into this method from multiple perspectives: as an elegant integration of training both the policy and implicit verifier in a unified model, and as a variational optimization approach. Code is available at https://github.com/sail-sg/VeriFree.
Related papers
- CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z) - Revisiting LLM Reasoning via Information Bottleneck [57.519119962528166]
Large language models (LLMs) have recently demonstrated remarkable progress in reasoning capabilities through reinforcement learning with verifiable rewards (RLVR)<n>We present a theoretical characterization of LLM reasoning grounded in information bottleneck (IB) principle.<n>We propose IB-aware reasoning optimization (IBRO), a framework that encourages reasoning trajectories to be both informative about the final correct answer and generalizable.
arXiv Detail & Related papers (2025-07-24T13:14:25Z) - wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models [15.638885149395657]
Intractability of dLLMs likelihood function requires approximating the current, old, and reference policy likelihoods at each policy optimization step.<n>We introduce $mathttwd1$, a novel policy optimization approach that reformulates the objective as a weighted likelihood.<n>Experiments on widely used reasoning benchmarks demonstrate that $mathttwd1$, without supervised fine-tuning (SFT) or any supervised data, outperforms existing RL methods for dLLMs.
arXiv Detail & Related papers (2025-07-07T21:27:25Z) - VerIF: Verification Engineering for Reinforcement Learning in Instruction Following [55.60192044049083]
Reinforcement learning with verifiable rewards (RLVR) has become a key technique for enhancing large language models (LLMs)<n>We propose VerIF, a verification method that combines rule-based code verification with LLM-based verification from a large reasoning model.<n>We apply RL training with VerIF to two models, achieving significant improvements across several representative instruction-following benchmarks.
arXiv Detail & Related papers (2025-06-11T17:10:36Z) - Response-Level Rewards Are All You Need for Online Reinforcement Learning in LLMs: A Mathematical Perspective [6.069069082518759]
We study the Zero-Reward Assumption in reinforcement learning for large language models (LLMs)<n>We show that the policy gradient based on true, unknown token-level rewards can be unbiasedly estimated using only a response-level reward model.<n>We propose a new algorithm: Token-Reinforced Policy Optimization (TRePO)
arXiv Detail & Related papers (2025-06-03T07:44:31Z) - Pitfalls of Rule- and Model-based Verifiers -- A Case Study on Mathematical Reasoning [26.717777746219635]
We take mathematical reasoning as a case study and conduct a comprehensive analysis of various verifiers in both static evaluation and RL training scenarios.<n>First, we find that current open-source rule-based verifiers often fail to recognize equivalent answers presented in different formats across commonly used mathematical datasets, resulting in non-negligible false negative rates.<n>We investigate model-based verifiers as a potential solution to address these limitations.<n>While the static evaluation shows that model-based verifiers achieve significantly higher verification accuracy, further analysis and RL training results imply that they are highly susceptible to hacking, where they misclassify certain patterns
arXiv Detail & Related papers (2025-05-28T10:28:41Z) - Learning to Reason without External Rewards [100.27210579418562]
Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision.<n>We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data.<n>We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal.
arXiv Detail & Related papers (2025-05-26T07:01:06Z) - TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning [11.573904453859098]
Reinforcement Learning (RL) has become a powerful tool for enhancing the reasoning abilities of large language models (LLMs)<n>Yet, RL's success relies on the reliability of rewards, which are provided by verifiers.<n>In this paper, we expose and analyze a widespread problem--false negatives--where verifiers wrongly reject correct model outputs.<n>We propose tinyV, a lightweight LLM-based verifier that augments existing rule-based methods.
arXiv Detail & Related papers (2025-05-20T17:16:44Z) - Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning.
LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors.
We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z) - Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities.
In-Context Learning (ICL) and.
Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting.
LLMs to downstream tasks.
We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z) - Generative Verifiers: Reward Modeling as Next-Token Prediction [29.543787728397643]
We propose training verifiers using the ubiquitous next-token prediction objective, jointly on verification and solution generation.<n>Compared to standard verifiers, such generative verifiers (GenRM) can benefit from several advantages of LLMs.<n>We observe improvements of 28% $rightarrow$ 44.6% on MATH, and 37.9% $rightarrow$ 53.5% on MMLU abstract algebra.
arXiv Detail & Related papers (2024-08-27T17:57:45Z) - Towards Effective Evaluations and Comparisons for LLM Unlearning Methods [97.2995389188179]
This paper seeks to refine the evaluation of machine unlearning for large language models.<n>It addresses two key challenges -- the robustness of evaluation metrics and the trade-offs between competing goals.
arXiv Detail & Related papers (2024-06-13T14:41:00Z) - Value Functions are Control Barrier Functions: Verification of Safe
Policies using Control Theory [46.85103495283037]
We propose a new approach to apply verification methods from control theory to learned value functions.
We formalize original theorems that establish links between value functions and control barrier functions.
Our work marks a significant step towards a formal framework for the general, scalable, and verifiable design of RL-based control systems.
arXiv Detail & Related papers (2023-06-06T21:41:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.