Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards
- URL: http://arxiv.org/abs/2510.07774v2
- Date: Thu, 23 Oct 2025 05:10:47 GMT
- Title: Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards
- Authors: Youliang Yuan, Qiuyang Mang, Jingbang Chen, Hong Wan, Xiaoyuan Liu, Junjielong Xu, Jen-tse Huang, Wenxuan Wang, Wenxiang Jiao, Pinjia He,
- Abstract summary: Large language models for mathematical reasoning are typically trained with outcome-based rewards, which credit only the final answer.<n>In our experiments, we observe that this paradigm is highly susceptible to reward hacking, leading to a substantial overestimation of a model's reasoning ability.<n>This is evidenced by a high incidence of false positives - solutions that reach the correct final answer through an unsound reasoning process.
- Score: 40.905635870672945
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Large language models for mathematical reasoning are typically trained with outcome-based rewards, which credit only the final answer. In our experiments, we observe that this paradigm is highly susceptible to reward hacking, leading to a substantial overestimation of a model's reasoning ability. This is evidenced by a high incidence of false positives - solutions that reach the correct final answer through an unsound reasoning process. Through a systematic analysis with human verification, we establish a taxonomy of these failure modes, identifying patterns like Miracle Steps - abrupt jumps to a correct output without a valid preceding derivation. Probing experiments suggest a strong association between these Miracle Steps and memorization, where the model appears to recall the answer directly rather than deriving it. To mitigate this systemic issue, we introduce the Rubric Reward Model (RRM), a process-oriented reward function that evaluates the entire reasoning trajectory against problem-specific rubrics. The generative RRM provides fine-grained, calibrated rewards (0-1) that explicitly penalize logical flaws and encourage rigorous deduction. When integrated into a reinforcement learning pipeline, RRM-based training consistently outperforms outcome-only supervision across four math benchmarks. Notably, it boosts Verified Pass@1024 on AIME2024 from 26.7% to 62.6% and reduces the incidence of Miracle Steps by 71%. Our work demonstrates that rewarding the solution process is crucial for building models that are not only more accurate but also more reliable.
Related papers
- Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models [108.26461635308796]
We introduce Rationale Consistency, a fine-grained metric that quantifies the alignment between the model's reasoning process and human judgment.<n>Our evaluation of frontier models reveals that rationale consistency effectively discriminates among state-of-the-art models.<n>We introduce a hybrid signal that combines rationale consistency with outcome accuracy for GenRM training.
arXiv Detail & Related papers (2026-02-04T15:24:52Z) - P2S: Probabilistic Process Supervision for General-Domain Reasoning Question Answering [51.04492568024515]
We introduce Probabilistic Process Supervision (P2S), a novel framework for fine-grained process rewards.<n>P2S provides fine-grained process rewards without requiring a separate reward model or human-annotated reasoning steps.
arXiv Detail & Related papers (2026-01-28T14:35:20Z) - Save the Good Prefix: Precise Error Penalization via Process-Supervised RL to Enhance LLM Reasoning [59.76691952347156]
Reinforcement learning (RL) has emerged as a powerful framework for improving the reasoning capabilities of large language models (LLMs)<n>Most existing RL approaches rely on sparse outcome rewards, which fail to credit correct intermediate steps in partially successful solutions.<n>We propose Verifiable Prefix Policy Optimization (VPPO), which uses PRMs only to localize the first error during RL.
arXiv Detail & Related papers (2026-01-26T21:38:20Z) - InT: Self-Proposed Interventions Enable Credit Assignment in LLM Reasoning [32.274434679047395]
Outcome-reward reinforcement learning (RL) has proven effective at improving the reasoning capabilities of large language models (LLMs)<n>Standard RL assigns credit only at the level of the final answer, penalizing entire reasoning traces when the outcome is incorrect.<n>We introduce Intervention Training (InT), a training paradigm in which the model performs fine-grained credit assignment on its own reasoning traces.
arXiv Detail & Related papers (2026-01-20T18:15:38Z) - Efficient Reasoning via Reward Model [24.105621725286497]
Reinforcement learning with verifiable rewards (RLVR) has been shown to enhance the reasoning capabilities of large language models (LLMs)<n>LRMs such as DeepSeek-R1 and OpenAI o1 often generate verbose responses containing redundant or irrelevant reasoning step-a phenomenon known as overthinking.<n>We introduce a novel reward formulation named Conciseness Reward Function (CRF) with explicit dependency between the outcome reward and conciseness score.
arXiv Detail & Related papers (2025-11-12T09:51:07Z) - Linking Process to Outcome: Conditional Reward Modeling for LLM Reasoning [30.302863491794543]
Process Reward Models (PRMs) aim to guide their step-by-step reasoning toward a final answer.<n>Existing PRMs fail to capture inter-step dependencies, or struggle to align process rewards with the final outcome.<n>We propose Conditional Reward Modeling that frames reasoning as a temporal process leading to a correct answer.
arXiv Detail & Related papers (2025-09-30T17:38:45Z) - Promoting Efficient Reasoning with Verifiable Stepwise Reward [7.385337642642193]
Large reasoning models (LRMs) have recently achieved significant progress in complex reasoning tasks, aided by reinforcement learning.<n>LRMs often suffer from overthinking, expending excessive computation on simple problems and reducing efficiency.<n>We propose a novel rule-based verifiable stepwise reward mechanism (VSRM), which assigns rewards based on the performance of intermediate states in the reasoning trajectory.
arXiv Detail & Related papers (2025-08-14T02:43:53Z) - Accelerating LLM Reasoning via Early Rejection with Partial Reward Modeling [12.835376812101323]
We introduce the hypothesis that PRMs are also Partial Reward Models.<n>This allows for principled early rejection based on intermediate token-level signals.<n>On math reasoning benchmarks, our method achieves up to 1.4$times$-9$times$ reduction in inference FLOPs without degrading final performance.
arXiv Detail & Related papers (2025-08-04T00:58:56Z) - Reward Reasoning Model [104.39256985858428]
Reward Reasoning Models (RRMs) are designed to execute a deliberate reasoning process before generating final rewards.<n>We implement a reinforcement learning framework that fosters self-evolved reward reasoning capabilities.<n> Notably, RRMs can adaptively exploit test-time compute to further improve reward accuracy.
arXiv Detail & Related papers (2025-05-20T17:58:03Z) - RM-R1: Reward Modeling as Reasoning [81.50471199906738]
Reasoning Reward Models (ReasRMs) formulate reward modeling as a reasoning task.<n>We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1.<n>Our models achieve state-of-the-art performance across three reward model benchmarks on average.
arXiv Detail & Related papers (2025-05-05T06:11:12Z) - R-PRM: Reasoning-Driven Process Reward Modeling [53.06844294668382]
Process Reward Models (PRMs) have emerged as a promising solution by evaluating each reasoning step.<n>Existing PRMs typically output evaluation scores directly, limiting both learning efficiency and evaluation accuracy.<n>We propose Reasoning-Driven Process Reward Modeling (R-PRM)<n>R-PRM generates seed data from limited annotations, effectively bootstrapping our model's reasoning capabilities.
arXiv Detail & Related papers (2025-03-27T09:23:08Z) - Entropy-Regularized Process Reward Model [43.09203393852343]
Large language models (LLMs) have shown promise in performing complex multi-step reasoning, yet they continue to struggle with mathematical reasoning.<n>We propose an entropy-regularized process reward model (ER-PRM) that integrates KL-regularized Markov Decision Processes (MDP)<n>Our empirical experiments on the MATH and GSM8K benchmarks demonstrate that ER-PRM consistently outperforms existing process reward models.
arXiv Detail & Related papers (2024-12-15T01:09:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.