Linking Process to Outcome: Conditional Reward Modeling for LLM Reasoning
- URL: http://arxiv.org/abs/2509.26578v1
- Date: Tue, 30 Sep 2025 17:38:45 GMT
- Title: Linking Process to Outcome: Conditional Reward Modeling for LLM Reasoning
- Authors: Zheng Zhang, Ziwei Shan, Kaitao Song, Yexin Li, Kan Ren,
- Abstract summary: Process Reward Models (PRMs) aim to guide their step-by-step reasoning toward a final answer.<n>Existing PRMs fail to capture inter-step dependencies, or struggle to align process rewards with the final outcome.<n>We propose Conditional Reward Modeling that frames reasoning as a temporal process leading to a correct answer.
- Score: 30.302863491794543
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Process Reward Models (PRMs) have emerged as a promising approach to enhance the reasoning capabilities of large language models (LLMs) by guiding their step-by-step reasoning toward a final answer. However, existing PRMs either treat each reasoning step in isolation, failing to capture inter-step dependencies, or struggle to align process rewards with the final outcome. Consequently, the reward signal fails to respect temporal causality in sequential reasoning and faces ambiguous credit assignment. These limitations make downstream models vulnerable to reward hacking and lead to suboptimal performance. In this work, we propose Conditional Reward Modeling (CRM) that frames LLM reasoning as a temporal process leading to a correct answer. The reward of each reasoning step is not only conditioned on the preceding steps but also explicitly linked to the final outcome of the reasoning trajectory. By enforcing conditional probability rules, our design captures the causal relationships among reasoning steps, with the link to the outcome allowing precise attribution of each intermediate step, thereby resolving credit assignment ambiguity. Further, through this consistent probabilistic modeling, the rewards produced by CRM enable more reliable cross-sample comparison. Experiments across Best-of-N sampling, beam search and reinforcement learning demonstrate that CRM consistently outperforms existing reward models, offering a principled framework for enhancing LLM reasoning. In particular, CRM is more robust to reward hacking and delivers stable downstream improvements without relying on verifiable rewards derived from ground truth.
Related papers
- Recursive Think-Answer Process for LLMs and VLMs [54.52289112197118]
We propose an efficient Recursive Think-Answer Process (R-TAP)<n>R-TAP enables models to engage in iterative reasoning cycles and generate more accurate answers.<n>We show that R-TAP-enhanced models consistently outperform conventional single-pass methods.
arXiv Detail & Related papers (2026-03-02T17:20:10Z) - Balancing Faithfulness and Performance in Reasoning via Multi-Listener Soft Execution [79.98699884805636]
Reasoning Execution by Multiple Listeners (REMUL) is a multi-party reinforcement learning approach.<n>REMUL builds on the hypothesis that reasoning traces which other parties can follow will be more faithful.<n>Speakers are rewarded for producing reasoning that is clear to listeners.
arXiv Detail & Related papers (2026-02-18T02:55:55Z) - MR-Align: Meta-Reasoning Informed Factuality Alignment for Large Reasoning Models [43.872922223495586]
Large reasoning models (LRMs) show strong capabilities in complex reasoning, yet their marginal gains on evidence-dependent factual questions are limited.<n>We find this limitation is partially attributable to a reasoning-answer hit gap, where the model identifies the correct facts during reasoning but fails to incorporate them into the final response.<n>We propose MR-ALIGN, a framework that enhances factuality without relying on external verifiers.
arXiv Detail & Related papers (2025-10-27T15:00:54Z) - From <Answer> to <Think>: Multidimensional Supervision of Reasoning Process for LLM Optimization [62.07990937720985]
Dimension-level Reward Model (DRM) is a new supervision framework for Large Language Models.<n>DRM evaluates the quality of a reasoning process along three fundamental, complementary, and interpretable dimensions.<n> Experimental results show that DRM provides effective supervision signals, guides the optimization of LLMs and enhances their reasoning ability.
arXiv Detail & Related papers (2025-10-13T14:29:15Z) - Accelerating LLM Reasoning via Early Rejection with Partial Reward Modeling [12.835376812101323]
We introduce the hypothesis that PRMs are also Partial Reward Models.<n>This allows for principled early rejection based on intermediate token-level signals.<n>On math reasoning benchmarks, our method achieves up to 1.4$times$-9$times$ reduction in inference FLOPs without degrading final performance.
arXiv Detail & Related papers (2025-08-04T00:58:56Z) - Discriminative Policy Optimization for Token-Level Reward Models [55.98642069903191]
Process reward models (PRMs) provide more nuanced supervision compared to outcome reward models (ORMs)<n>Q-RM explicitly learns token-level Q-functions from preference data without relying on fine-grained annotations.<n>Reinforcement learning with Q-RM significantly enhances training efficiency, achieving convergence 12 times faster than ORM on GSM8K and 11 times faster than step-level PRM on MATH.
arXiv Detail & Related papers (2025-05-29T11:40:34Z) - Reward Reasoning Model [104.39256985858428]
Reward Reasoning Models (RRMs) are designed to execute a deliberate reasoning process before generating final rewards.<n>We implement a reinforcement learning framework that fosters self-evolved reward reasoning capabilities.<n> Notably, RRMs can adaptively exploit test-time compute to further improve reward accuracy.
arXiv Detail & Related papers (2025-05-20T17:58:03Z) - ConCISE: Confidence-guided Compression in Step-by-step Efficient Reasoning [64.93140713419561]
Large Reasoning Models (LRMs) perform strongly in complex reasoning tasks via Chain-of-Thought (CoT) prompting, but often suffer from verbose outputs.<n>Existing fine-tuning-based compression methods either operate post-hoc pruning, risking disruption to reasoning coherence, or rely on sampling-based selection.<n>We introduce ConCISE, a framework designed to generate concise reasoning chains, integrating Confidence Injection to boost reasoning confidence, and Early Stopping to terminate reasoning when confidence is sufficient.
arXiv Detail & Related papers (2025-05-08T01:40:40Z) - RM-R1: Reward Modeling as Reasoning [81.50471199906738]
Reasoning Reward Models (ReasRMs) formulate reward modeling as a reasoning task.<n>We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1.<n>Our models achieve state-of-the-art performance across three reward model benchmarks on average.
arXiv Detail & Related papers (2025-05-05T06:11:12Z) - Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations [30.68740512996253]
Chain-of-thought explanations are widely used to inspect the decision process of large language models.<n>We show that preference optimization can inadvertently reduce the faithfulness of these explanations.
arXiv Detail & Related papers (2025-04-07T17:49:23Z) - Reward Models Identify Consistency, Not Causality [54.987590763737145]
State-of-the-art reward models prioritize structural consistency over causal correctness.<n>Removing the problem statement has minimal impact on reward scores.<n> altering numerical values or disrupting the reasoning flow significantly affects RM outputs.
arXiv Detail & Related papers (2025-02-20T14:57:14Z) - ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding [25.329712997545794]
We propose Retrieval-Augmented Reasoning through Trustworthy Process Rewarding (ReARTeR)<n>ReARTeR enhances RAG systems' reasoning capabilities through post-training and test-time scaling.<n> Experimental results on multi-step reasoning benchmarks demonstrate significant improvements.
arXiv Detail & Related papers (2025-01-14T05:56:26Z) - Question Decomposition Improves the Faithfulness of Model-Generated
Reasoning [23.34325378824462]
Large language models (LLMs) are difficult to verify the correctness and safety of their behavior.
One approach is to prompt LLMs to externalize their reasoning, by having them generate step-by-step reasoning as they answer a question.
This approach relies on the stated reasoning faithfully reflecting the model's actual reasoning, which is not always the case.
Decomposition-based methods achieve strong performance on question-answering tasks, sometimes approaching that of CoT.
arXiv Detail & Related papers (2023-07-17T00:54:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.