Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics
- URL: http://arxiv.org/abs/2602.10885v1
- Date: Wed, 11 Feb 2026 14:13:46 GMT
- Title: Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics
- Authors: Leheng Sheng, Wenchang Ma, Ruixin Hong, Xiang Wang, An Zhang, Tat-Seng Chua,
- Abstract summary: We propose textbfRLCER (textbfReinforcement textbfLearning with textbfCoT Supervision via Self-textbfEvolving textbfRubrics), which enhances the outcome-centric RLVR.<n>We show that self-proposed and self-evolving rubrics provide reliable CoT supervision signals even without outcome rewards.
- Score: 54.03266761370048
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite chain-of-thought (CoT) playing crucial roles in LLM reasoning, directly rewarding it is difficult: training a reward model demands heavy human labeling efforts, and static RMs struggle with evolving CoT distributions and reward hacking. These challenges motivate us to seek an autonomous CoT rewarding approach that requires no human annotation efforts and can evolve gradually. Inspired by recent self-evolving training methods, we propose \textbf{RLCER} (\textbf{R}einforcement \textbf{L}earning with \textbf{C}oT Supervision via Self-\textbf{E}volving \textbf{R}ubrics), which enhances the outcome-centric RLVR by rewarding CoTs with self-proposed and self-evolving rubrics. We show that self-proposed and self-evolving rubrics provide reliable CoT supervision signals even without outcome rewards, enabling RLCER to outperform outcome-centric RLVR. Moreover, when used as in-prompt hints, these self-proposed rubrics further improve inference-time performance.
Related papers
- Discovering Process-Outcome Credit in Multi-Step LLM Reasoning [3.584086358722852]
Reinforcement Learning (RL) serves as a potent paradigm for enhancing reasoning capabilities in Large Language Models (LLMs)<n>We propose a novel framework designed to provide continuous reward signals.<n>Our model exhibits superior out-of-distribution robustness, demonstrating promising zero-shot transfer capabilities to unseen and challenging reasoning tasks.
arXiv Detail & Related papers (2026-02-01T05:44:09Z) - Latent Chain-of-Thought for Visual Reasoning [53.541579327424046]
Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs)<n>We reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference.<n>We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks.
arXiv Detail & Related papers (2025-10-27T23:10:06Z) - Rewarding the Journey, Not Just the Destination: A Composite Path and Answer Self-Scoring Reward Mechanism for Test-Time Reinforcement Learning [29.778703252962092]
Reinforcement Learning (RL) has emerged as a powerful paradigm for advancing Large Language Models (LLMs)<n>We develop a novel test-time reward mechanism that operates without external supervision.
arXiv Detail & Related papers (2025-10-20T07:53:51Z) - Confidence as a Reward: Transforming LLMs into Reward Models [54.98336080630691]
Confidence-as-a-Reward (CRew) is a training-free method that utilizes token-level confidence in the model's final answers as a proxy for reward.<n>We show that CRew outperforms existing training-free reward approaches on the MATH500 and RewardMATH benchmarks.<n>We propose CRew-DPO, a training strategy that constructs preference data from confidence scores combined with correctness signals.
arXiv Detail & Related papers (2025-10-15T12:51:47Z) - Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models [56.055015597319674]
Reinforcement learning with verifiable rewards (RLVR) is effective to improve the reasoning ability of large language models (LLMs)<n>Recent self-rewarding methods investigate a label-free alternative to unlock the reasoning capabilities of LLMs.<n>We propose textitCo-rewarding, a novel self-supervised RL framework that improves training stability by seeking complementary supervision from another views.
arXiv Detail & Related papers (2025-08-01T08:09:14Z) - Intra-Trajectory Consistency for Reward Modeling [67.84522106537274]
We develop an intra-trajectory consistency regularization to enforce that adjacent processes with higher next-token generation probability maintain more consistent rewards.<n>We show that the reward model trained with the proposed regularization induces better DPO-aligned policies and achieves better best-of-N (BON) inference-time verification results.
arXiv Detail & Related papers (2025-06-10T12:59:14Z) - Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning [87.7836502955847]
We propose a novel self-rewarding reinforcement learning framework to enhance Large Language Model (LLM) reasoning.<n>Our key insight is that correct responses often exhibit consistent trajectory patterns in terms of model likelihood.<n>We introduce CoVo, an intrinsic reward mechanism that integrates Consistency and Volatility via a robust vector-space aggregation strategy.
arXiv Detail & Related papers (2025-06-10T12:40:39Z) - Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models [50.4652276723694]
Think-RM generates flexible, self-guided reasoning traces that support advanced capabilities.<n>Think-RM achieves state-of-the-art results on RM-Bench, outperforming both BT RM and vertically scaled GenRM by 8%.
arXiv Detail & Related papers (2025-05-22T05:56:11Z) - Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards [67.86091419220816]
Large Language Models (LLMs) show great promise in complex reasoning.<n>A prevalent issue is superficial self-reflection'', where models fail to robustly verify their own outputs.<n>We introduce RISE (Reinforcing Reasoning with Self-Verification), a novel online RL framework designed to tackle this.
arXiv Detail & Related papers (2025-05-19T17:59:31Z) - PROGRESSOR: A Perceptually Guided Reward Estimator with Self-Supervised Online Refinement [16.768912344111946]
We present PROGRESSOR, a framework that learns a task-agnostic reward function from videos.<n>We show that PROGRESSOR enables robots to learn complex behaviors without any external supervision.
arXiv Detail & Related papers (2024-11-26T04:17:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.