Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training
- URL: http://arxiv.org/abs/2602.01511v1
- Date: Mon, 02 Feb 2026 00:50:53 GMT
- Title: Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training
- Authors: Ran Xu, Tianci Liu, Zihan Dong, Tony You, Ilgee Hong, Carl Yang, Linjun Zhang, Tao Zhao, Haoyu Wang,
- Abstract summary: gradient-ARM is a framework that jointly optimize a rubric generator and a judge using reinforcement learning from preference feedback.<n>We show that gradient-ARM achieves state-of-the-art performance among baselines on benchmarks and significantly improves downstream policy alignment in both offline and online reinforcement learning settings.
- Score: 29.56905427210088
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Standard reward models typically predict scalar scores that fail to capture the multifaceted nature of response quality in non-verifiable domains, such as creative writing or open-ended instruction following. To address this limitation, we propose Rubric-ARM, a framework that jointly optimizes a rubric generator and a judge using reinforcement learning from preference feedback. Unlike existing methods that rely on static rubrics or disjoint training pipelines, our approach treats rubric generation as a latent action learned to maximize judgment accuracy. We introduce an alternating optimization strategy to mitigate the non-stationarity of simultaneous updates, providing theoretical analysis that demonstrates how this schedule reduces gradient variance during training. Extensive experiments show that Rubric-ARM achieves state-of-the-art performance among baselines on multiple benchmarks and significantly improves downstream policy alignment in both offline and online reinforcement learning settings.
Related papers
- Contextual Rollout Bandits for Reinforcement Learning with Verifiable Rewards [69.74686029941881]
Reinforcement Learning with Verifiable Rewards (RLVR) is an effective paradigm for improving the reasoning capabilities of large language models.<n>We propose a unified neural scheduling framework that adaptively selects high-value rollouts throughout training.<n>Experiments on six mathematical reasoning benchmarks demonstrate consistent gains in performance and training efficiency across multiple RLVR optimization methods.
arXiv Detail & Related papers (2026-02-09T10:51:58Z) - Generative Actor Critic [74.04971271003869]
Generative Actor Critic (GAC) is a novel framework that decouples sequential decision-making by reframing textitpolicy evaluation as learning a generative model of the joint distribution over trajectories and returns.<n>Experiments on Gym-MuJoCo and Maze2D benchmarks demonstrate GAC's strong offline performance and significantly enhanced offline-to-online improvement compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-12-25T06:31:11Z) - Stabilizing Reinforcement Learning with LLMs: Formulation and Practices [61.361819972410046]
We show why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE.<n>This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training.
arXiv Detail & Related papers (2025-12-01T07:45:39Z) - OBLR-PO: A Theoretical Framework for Stable Reinforcement Learning [12.77713716713937]
We provide a unified theoretical framework that characterizes the statistical properties of commonly used policy-gradient estimators.<n>We derive an adaptive learning-rate schedule governed by the signal-to-noise ratio (SNR) of gradients.<n>We further show that the variance-optimal baseline is a gradient-weighted estimator, offering a new principle for variance reduction.
arXiv Detail & Related papers (2025-11-28T16:09:28Z) - CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLMs [53.749193998004166]
Curriculum learning plays a crucial role in enhancing the training efficiency of large language models.<n>We propose CurES, an efficient training method that accelerates convergence and employs Bayesian posterior estimation to minimize computational overhead.
arXiv Detail & Related papers (2025-10-01T15:41:27Z) - Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning [77.92320830700797]
Reinforcement Learning has played a central role in enabling reasoning capabilities of Large Language Models.<n>We propose a tractable computational framework that tracks and leverages curvature information during policy updates.<n>The algorithm, Curvature-Aware Policy Optimization (CAPO), identifies samples that contribute to unstable updates and masks them out.
arXiv Detail & Related papers (2025-10-01T12:29:32Z) - Nested-ReFT: Efficient Reinforcement Learning for Large Language Model Fine-Tuning via Off-Policy Rollouts [25.205293698698867]
We introduce Nested-ReFT, where a subset of layers of the target model acts as the behavior model to generate off-policy completions during training.<n>Our theoretical analysis shows that Nested-ReFT yields unbiased gradient estimates with controlled variance.<n>Our empirical analysis demonstrates improved computational efficiency measured as tokens/sec across multiple math reasoning benchmarks and model sizes.
arXiv Detail & Related papers (2025-08-13T18:37:46Z) - Fast and Stable Diffusion Planning through Variational Adaptive Weighting [3.745003761050674]
Diffusion models have recently shown promise in offline RL.<n>These methods often suffer from high training costs and slow convergence.<n>We introduce a closed-form approximation method for its online estimation under the flow-based generative modeling framework.<n> Experimental results on Maze2D and Kitchen tasks show that our method achieves competitive performance with up to 10 times fewer training steps.
arXiv Detail & Related papers (2025-06-20T02:12:04Z) - Prior Constraints-based Reward Model Training for Aligning Large Language Models [58.33118716810208]
This paper proposes a Prior Constraints-based Reward Model (namely PCRM) training method to mitigate this problem.
PCRM incorporates prior constraints, specifically, length ratio and cosine similarity between outputs of each comparison pair, during reward model training to regulate optimization magnitude and control score margins.
Experimental results demonstrate that PCRM significantly improves alignment performance by effectively constraining reward score scaling.
arXiv Detail & Related papers (2024-04-01T07:49:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.