Related papers: CLARity: Reasoning Consistency Alone Can Teach Reinforced Experts

CLARity: Reasoning Consistency Alone Can Teach Reinforced Experts

URL: http://arxiv.org/abs/2510.09278v1
Date: Fri, 10 Oct 2025 11:21:09 GMT
Title: CLARity: Reasoning Consistency Alone Can Teach Reinforced Experts
Authors: Jiuheng Lin, Cong Jiang, Zirui Wu, Jiarui Sun, Yansong Feng,
Abstract summary: Training expert LLMs in domains with scarce data is difficult, often relying on multiple-choice questions (MCQs)<n>Existing solutions to supervise reasoning, such as large-scale Process Reward Models (PRMs), are prohibitively expensive.<n>We propose CLARity, a cost-effective RL framework that enhances reasoning quality using only a small, general-purpose LLM.
Score: 20.606939295163752
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training expert LLMs in domains with scarce data is difficult, often relying on multiple-choice questions (MCQs). However, standard outcome-based reinforcement learning (RL) on MCQs is risky. While it may improve accuracy, we observe it often degrades reasoning quality such as logical consistency. Existing solutions to supervise reasoning, such as large-scale Process Reward Models (PRMs), are prohibitively expensive. To address this, we propose CLARity, a cost-effective RL framework that enhances reasoning quality using only a small, general-purpose LLM. CLARity integrates a consistency-aware reward mechanism with a 2-stage refine-then-monitor training pipeline to enhance reasoning consistency, and a dynamic data reformulation strategy to to better exploit limited data. Experiments demonstrate that CLARity improves response consistency by 16.5% and accuracy by 7.5% over baselines. Human evaluations further confirm holistic improvements in coherence and professionalism. Thus, CLARity offers a generalizable solution that enables smaller models to effectively guide expert models by reasoning consistency.Our code is open sourced at: https://github.com/Infinite-set/CLARity

Related papers

When Actions Teach You to Think: Reasoning-Action Synergy via Reinforcement Learning in Conversational Agents [2.689316553293938]
Supervised fine-tuning (SFT) has emerged as one of the most effective ways to improve the performance of large language models (LLMs) in downstream tasks.<n>We propose a pipeline in which LLMs generate reasoning steps that guide both the invocation of tools and the final answer generation for conversational agents.
arXiv Detail & Related papers (2025-12-12T04:44:40Z)
Efficient Reasoning via Reward Model [24.105621725286497]
Reinforcement learning with verifiable rewards (RLVR) has been shown to enhance the reasoning capabilities of large language models (LLMs)<n>LRMs such as DeepSeek-R1 and OpenAI o1 often generate verbose responses containing redundant or irrelevant reasoning step-a phenomenon known as overthinking.<n>We introduce a novel reward formulation named Conciseness Reward Function (CRF) with explicit dependency between the outcome reward and conciseness score.
arXiv Detail & Related papers (2025-11-12T09:51:07Z)
Rewarding the Journey, Not Just the Destination: A Composite Path and Answer Self-Scoring Reward Mechanism for Test-Time Reinforcement Learning [29.778703252962092]
Reinforcement Learning (RL) has emerged as a powerful paradigm for advancing Large Language Models (LLMs)<n>We develop a novel test-time reward mechanism that operates without external supervision.
arXiv Detail & Related papers (2025-10-20T07:53:51Z)
Confidence as a Reward: Transforming LLMs into Reward Models [54.98336080630691]
Confidence-as-a-Reward (CRew) is a training-free method that utilizes token-level confidence in the model's final answers as a proxy for reward.<n>We show that CRew outperforms existing training-free reward approaches on the MATH500 and RewardMATH benchmarks.<n>We propose CRew-DPO, a training strategy that constructs preference data from confidence scores combined with correctness signals.
arXiv Detail & Related papers (2025-10-15T12:51:47Z)
Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models [56.055015597319674]
Reinforcement learning with verifiable rewards (RLVR) is effective to improve the reasoning ability of large language models (LLMs)<n>Recent self-rewarding methods investigate a label-free alternative to unlock the reasoning capabilities of LLMs.<n>We propose textitCo-rewarding, a novel self-supervised RL framework that improves training stability by seeking complementary supervision from another views.
arXiv Detail & Related papers (2025-08-01T08:09:14Z)
Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs [35.27561531876348]
This paper systematically investigates the impact of Reinforcement Learning with Verifiable Rewards (RLVR) on Large Language Models (LLMs)<n>We show that RLVR can extend the reasoning boundary for both mathematical and coding tasks.<n>We present a theoretical framework explaining RLVR's incentive mechanism, demonstrating how it can encourage correct reasoning even when rewards are based solely on answer correctness.
arXiv Detail & Related papers (2025-06-17T07:06:56Z)
Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning [87.7836502955847]
We propose a novel self-rewarding reinforcement learning framework to enhance Large Language Model (LLM) reasoning.<n>Our key insight is that correct responses often exhibit consistent trajectory patterns in terms of model likelihood.<n>We introduce CoVo, an intrinsic reward mechanism that integrates Consistency and Volatility via a robust vector-space aggregation strategy.
arXiv Detail & Related papers (2025-06-10T12:40:39Z)
Reinforced Latent Reasoning for LLM-based Recommendation [83.18146814163308]
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities in complex problem-solving tasks.<n>Existing methods typically rely on fine-tuning with explicit chain-of-thought (CoT) data.<n>In this work, we explore an alternative approach that shifts from explicit CoT reasoning to compact, information-dense latent reasoning.
arXiv Detail & Related papers (2025-05-25T11:03:45Z)
Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards [67.86091419220816]
Large Language Models (LLMs) show great promise in complex reasoning.<n>A prevalent issue is superficial self-reflection'', where models fail to robustly verify their own outputs.<n>We introduce RISE (Reinforcing Reasoning with Self-Verification), a novel online RL framework designed to tackle this.
arXiv Detail & Related papers (2025-05-19T17:59:31Z)
S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference.<n>Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z)
Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding [74.31981011985681]
Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps. We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution. We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures.
arXiv Detail & Related papers (2024-11-06T22:02:30Z)
CREAM: Consistency Regularized Self-Rewarding Language Models [34.325289477993586]
Self-rewarding large language models (LLM) have successfully applied LLM-as-a-Judge to improve the alignment performance without the need of human annotations for preference data.<n>However, there is no guarantee of accuracy in the rewarding and ranking, which is critical for ensuring accurate rewards and high-quality preference data.<n>We propose a Consistency Regularized sElf-rewarding lAnguage Model (CREAM) that leverages the consistency of rewards across different iterations to regularize the self-rewarding training.
arXiv Detail & Related papers (2024-10-16T16:51:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.