Related papers: JudgeRLVR: Judge First, Generate Second for Efficient Reasoning

JudgeRLVR: Judge First, Generate Second for Efficient Reasoning

URL: http://arxiv.org/abs/2601.08468v1
Date: Tue, 13 Jan 2026 11:47:42 GMT
Title: JudgeRLVR: Judge First, Generate Second for Efficient Reasoning
Authors: Jiangshan Duo, Hanyu Li, Hailin Zhang, Yudong Wang, Sujian Li, Liang Zhao,
Abstract summary: Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for reasoning in Large Language Models.<n>In this paper, we argue that discriminative capability is a prerequisite for efficient generation.<n>We propose JudgeRLVR, a two-stage judge-then-generate paradigm.
Score: 20.448286296459344
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for reasoning in Large Language Models. However, optimizing solely for final-answer correctness often drives models into aimless, verbose exploration, where they rely on exhaustive trial-and-error tactics rather than structured planning to reach solutions. While heuristic constraints like length penalties can reduce verbosity, they often truncate essential reasoning steps, creating a difficult trade-off between efficiency and verification. In this paper, we argue that discriminative capability is a prerequisite for efficient generation: by learning to distinguish valid solutions, a model can internalize a guidance signal that prunes the search space. We propose JudgeRLVR, a two-stage judge-then-generate paradigm. In the first stage, we train the model to judge solution responses with verifiable answers. In the second stage, we fine-tune the same model with vanilla generating RLVR initialized from the judge. Compared to Vanilla RLVR using the same math-domain training data, JudgeRLVR achieves a better quality--efficiency trade-off for Qwen3-30B-A3B: on in-domain math, it delivers about +3.7 points average accuracy gain with -42\% average generation length; on out-of-domain benchmarks, it delivers about +4.5 points average accuracy improvement, demonstrating enhanced generalization.

Related papers

LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards [51.45138356629732]
We introduce LongRLVR to augment the sparse answer reward with a dense and verifiable context reward.<n>This auxiliary signal directly incentivizes the model for selecting the correct grounding information.<n>LongRLVR consistently and significantly outperforms the standard RLVR across all models and benchmarks.
arXiv Detail & Related papers (2026-03-02T18:07:53Z)
Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning [17.384089089363382]
We identify a root cause that existing methods overlook: the uniform penalization of errors.<n>Current approaches treat all incorrect rollouts within a group identically.<n>We propose the Asymmetric Confidence-aware Error Penalty (ACE)
arXiv Detail & Related papers (2026-02-24T22:46:43Z)
Adaptive Ability Decomposing for Unlocking Large Reasoning Model Effective Reinforcement Learning [82.91265691530351]
A$2$D is an Adaptive Ability Decomposing method for enhancing the effectiveness ofReinforcement Learning with verifiable rewards.<n>We first train a decomposer via RLVR without distillation, enabling it to decompose complex questions into a set of simpler sub-questions.<n>Next, we use this decomposer to annotate sub-questions for each question in the training dataset, and then train the reasoner under RLVR with sub-question guidance.
arXiv Detail & Related papers (2026-01-31T14:48:23Z)
Self-Improving VLM Judges Without Human Annotations [74.29324865147838]
We present a framework to self-train a VLM judge model without any human preference annotations, using only self-synthesized data.<n>Our method improves a Llama-3.2-11B multimodal judge from 0.38 to 0.51 in overall accuracy on Multimodal RewardBench.<n>The overall strength of these human-annotation-free results suggest the potential for a future self-judge that evolves alongside rapidly improving VLM capabilities.
arXiv Detail & Related papers (2025-12-02T20:52:19Z)
Masked-and-Reordered Self-Supervision for Reinforcement Learning from Verifiable Rewards [13.064343544668283]
We propose MR-RLVR (Masked-and-Reordered RLVR), which constructs process-level self-supervised rewards via "masked-then-fill" and "step reordering"<n>We implement MR-RLVR on Qwen2.5-3B and DeepSeek-R1-Distill-Qwen-1.5B, and evaluate on AIME24, AIME25, AMC23, and MATH500.
arXiv Detail & Related papers (2025-11-21T18:23:04Z)
Limits of Generalization in RLVR: Two Case Studies in Mathematical Reasoning [3.437656066916039]
Reinforcement with Verifiable Rewards (RLVR) has emerged as a promising approach for enhancing such capabilities.<n>We investigate RLVR on two problems with fully verifiable solutions.<n>We find that RLVR improves evaluation metrics but often by reinforcing superficial Learning metrics rather than acquiring new reasoning strategies.
arXiv Detail & Related papers (2025-10-30T23:16:02Z)
Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains [13.626335241662977]
Reinforcement Learning with Verifiable Rewards (RLVR)-based post-training of Large Language Models (LLMs) has been shown to improve accuracy on reasoning tasks.<n>We investigate the effect of RL post-training on intermediate tokens which are not directly incentivized.
arXiv Detail & Related papers (2025-10-20T23:58:31Z)
LaSeR: Reinforcement Learning with Last-Token Self-Rewarding [54.72617309922891]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a core paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs)<n>Previous practice requires the LLM to sequentially generate solutions and self-verifications using two separate prompt templates, which significantly reduces efficiency.<n>We propose LaSeR (Reinforcement Learning with Last-Token Self-Rewarding), an algorithm that simply augments the original RLVR loss with a MSE loss.
arXiv Detail & Related papers (2025-10-16T17:55:11Z)
The Reasoning Boundary Paradox: How Reinforcement Learning Constrains Language Models [31.773914661815393]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key method for improving Large Language Models' reasoning capabilities.<n>Recent evidence suggests it may paradoxically shrink the reasoning boundary rather than expand it.<n>This paper investigates the shrinkage issue of RLVR by analyzing its learning dynamics.
arXiv Detail & Related papers (2025-10-02T17:17:27Z)
Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers [90.50039419576807]
Reinforcement Learning with Verifiable Rewards (RLVR) trains policies against automated verifiers to avoid costly human labeling.<n>To reduce vulnerability to verifier hacking, many RLVR systems collapse rewards to binary $0,1$ during training.<n>This choice carries a cost: it introduces textitfalse negatives (rejecting correct answers, FNs) and textitfalse positives (accepting incorrect ones, FPs)
arXiv Detail & Related papers (2025-10-01T13:56:44Z)
CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models [85.315711639214]
We introduce Curiosity-Driven Exploration (CDE), a framework that leverages the model's own intrinsic sense of curiosity to guide exploration.<n>For the actor, we use perplexity over its generated response, and for the critic, we use the variance of value estimates from a multi-head architecture.<n>Our theoretical analysis shows that the actor-wise bonus inherently penalizes overconfident errors and promotes diversity among correct responses.
arXiv Detail & Related papers (2025-09-11T17:59:17Z)
Rethinking Reasoning Quality in Large Language Models through Enhanced Chain-of-Thought via RL [19.659532349434418]
Reinforcement learning (RL) has recently become the dominant paradigm for strengthening the reasoning abilities of large language models.<n>Yet the rule-based reward functions commonly used on mathematical or programming benchmarks assess only answer format and correctness.<n>We propose Dynamic Reasoning Efficiency Reward (DRER) -- a plug-and-play RL reward framework that reshapes both reward and advantage signals.
arXiv Detail & Related papers (2025-09-07T11:52:18Z)
The Price of a Second Thought: On the Evaluation of Reasoning Efficiency in Large Language Models [54.88805865447848]
We show that instruct models achieve higher efficiency overall, and problem difficulty affects efficiency.<n>We propose COTHINK, a simple two-stage pipeline: an instruct model drafts a brief outline, and a thinking model expands it.<n>On GSM8K, MATH500, and AIME24, COTHINK cuts token usage by 21.1% while keeping accuracy on four thinking models, and remains competitive with strong efficiency baselines.
arXiv Detail & Related papers (2025-05-28T06:24:45Z)
When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model [87.25037167380522]
We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable. In order to achieve a better accuracy, we propose two lightweight modules. DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers. QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
arXiv Detail & Related papers (2021-05-27T13:51:42Z)
The Right Tool for the Job: Matching Model and Instance Complexities [62.95183777679024]
As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs. We propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) "exit" We test our proposed modification on five different datasets in two tasks: three text classification datasets and two natural language inference benchmarks.
arXiv Detail & Related papers (2020-04-16T04:28:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.