DARL: Encouraging Diverse Answers for General Reasoning without Verifiers
- URL: http://arxiv.org/abs/2601.14700v1
- Date: Wed, 21 Jan 2026 06:23:55 GMT
- Title: DARL: Encouraging Diverse Answers for General Reasoning without Verifiers
- Authors: Chongxuan Huang, Lei Lin, Xiaodong Shi, Wenping Hu, Ruiming Tang,
- Abstract summary: We propose DARL, a reinforcement learning framework that encourages the generation of diverse answers within a controlled deviation range from the reference.<n>Our framework is fully compatible with existing general reinforcement learning methods and can be seamlessly integrated without additional verifiers.
- Score: 41.35516261603945
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated promising gains in enhancing the reasoning capabilities of large language models. However, its dependence on domain-specific verifiers significantly restricts its applicability to open and general domains. Recent efforts such as RLPR have extended RLVR to general domains, enabling training on broader datasets and achieving improvements over RLVR. However, a notable limitation of these methods is their tendency to overfit to reference answers, which constrains the model's ability to generate diverse outputs. This limitation is particularly pronounced in open-ended tasks such as writing, where multiple plausible answers exist. To address this, we propose DARL, a simple yet effective reinforcement learning framework that encourages the generation of diverse answers within a controlled deviation range from the reference while preserving alignment with it. Our framework is fully compatible with existing general reinforcement learning methods and can be seamlessly integrated without additional verifiers. Extensive experiments on thirteen benchmarks demonstrate consistent improvements in reasoning performance. Notably, DARL surpasses RLPR, achieving average gains of 1.3 points on six reasoning benchmarks and 9.5 points on seven general benchmarks, highlighting its effectiveness in improving both reasoning accuracy and output diversity.
Related papers
- Coupled Variational Reinforcement Learning for Language Model General Reasoning [83.82392089177841]
We propose textitbCoupled bVari bReinforcement bLearning (CoVRL) to bridge variational inference and reinforcement learning.<n>CoVRL improves performance by 12.4% over the base model and achieves an additional 2.3% improvement over strong state-of-the-art verifier-free RL baselines.
arXiv Detail & Related papers (2025-12-14T07:03:51Z) - Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning [79.365697698062]
We propose $textbfRGR-GRPO (Reward and Guidance through rubrics), a framework for multi-domain reasoning.<n>RGR-GRPO consistently outperforms RL methods that rely solely on alternative reward schemes or offline guidance.
arXiv Detail & Related papers (2025-11-15T20:14:51Z) - Auditable-choice reframing unlocks RL-based verification for open-ended tasks [23.12421867559344]
Verifiable Multiple-Choice Reformulation (VMR) is a novel training strategy that restructures open-ended data into verifiable multiple-choice formats.<n>Across eight open-ended benchmarks, our VMR-based training delivers an average gain of 5.99 points over the baseline.
arXiv Detail & Related papers (2025-11-04T10:45:52Z) - Limits of Generalization in RLVR: Two Case Studies in Mathematical Reasoning [3.437656066916039]
Reinforcement with Verifiable Rewards (RLVR) has emerged as a promising approach for enhancing such capabilities.<n>We investigate RLVR on two problems with fully verifiable solutions.<n>We find that RLVR improves evaluation metrics but often by reinforcing superficial Learning metrics rather than acquiring new reasoning strategies.
arXiv Detail & Related papers (2025-10-30T23:16:02Z) - RLPR: Extrapolating RLVR to General Domains without Verifiers [103.14103272635893]
We propose RLPR, a simple verifier-free framework that extrapolates RLVR to broader general domains.<n>We find that addressing the high variance of this noisy probability reward is crucial to make it work.<n>RLPR consistently improves reasoning capabilities in both areas for Gemma, Llama, and Qwen based models.
arXiv Detail & Related papers (2025-06-23T02:56:36Z) - Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective [82.24301452333577]
Reinforcement learning (RL) has emerged as a promising approach to improve large language model (LLM) reasoning.<n>A key challenge lies in the lack of reliable, scalable RL reward signals across diverse reasoning domains.<n>We introduce Guru, a curated RL reasoning corpus of 92K verifiable examples spanning six reasoning domains.
arXiv Detail & Related papers (2025-06-17T20:24:00Z) - Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains [92.36624674516553]
Reinforcement learning with verifiable rewards (RLVR) has demonstrated significant success in enhancing mathematical reasoning and coding performance of large language models (LLMs)<n>We investigate the effectiveness and scalability of RLVR across diverse real-world domains including medicine, chemistry, psychology, economics, and education.<n>We utilize a generative scoring technique that yields soft, model-based reward signals to overcome limitations posed by binary verifications.
arXiv Detail & Related papers (2025-03-31T08:22:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.