Related papers: Full-Step-DPO: Self-Supervised Preference Optimization with Step-wise Rewards for Mathematical Reasoning

Full-Step-DPO: Self-Supervised Preference Optimization with Step-wise Rewards for Mathematical Reasoning

URL: http://arxiv.org/abs/2502.14356v1
Date: Thu, 20 Feb 2025 08:28:11 GMT
Title: Full-Step-DPO: Self-Supervised Preference Optimization with Step-wise Rewards for Mathematical Reasoning
Authors: Huimin Xu, Xin Mao, Feng-Lin Li, Xiaobao Wu, Wang Chen, Wei Zhang, Anh Tuan Luu,
Abstract summary: We propose Full-Step-DPO, a novel DPO framework tailored for mathematical reasoning.<n>Instead of optimizing only the first erroneous step, it leverages step-wise rewards from the entire reasoning chain.<n>We show that Full-Step-DPO achieves superior performance compared to state-of-the-art baselines.
Score: 30.096211889103998
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Direct Preference Optimization (DPO) often struggles with long-chain mathematical reasoning. Existing approaches, such as Step-DPO, typically improve this by focusing on the first erroneous step in the reasoning chain. However, they overlook all other steps and rely heavily on humans or GPT-4 to identify erroneous steps. To address these issues, we propose Full-Step-DPO, a novel DPO framework tailored for mathematical reasoning. Instead of optimizing only the first erroneous step, it leverages step-wise rewards from the entire reasoning chain. This is achieved by training a self-supervised process reward model, which automatically scores each step, providing rewards while avoiding reliance on external signals. Furthermore, we introduce a novel step-wise DPO loss, which dynamically updates gradients based on these step-wise rewards. This endows stronger reasoning capabilities to language models. Extensive evaluations on both in-domain and out-of-domain mathematical reasoning benchmarks across various base language models, demonstrate that Full-Step-DPO achieves superior performance compared to state-of-the-art baselines.

Related papers

Autoregressive Direct Preference Optimization [42.22530165271681]
We introduce a novel formulation that explicitly introduces the autoregressive assumption prior to applying the Bradley-Terry model.<n>We derive a novel variant, termed Autoregressive DPO (ADPO), that explicitly integrates autoregressive modeling into the preference optimization framework.<n>We show that there exist two length measures to be considered when designing DPO-based algorithms.
arXiv Detail & Related papers (2026-02-10T08:45:30Z)
Save the Good Prefix: Precise Error Penalization via Process-Supervised RL to Enhance LLM Reasoning [59.76691952347156]
Reinforcement learning (RL) has emerged as a powerful framework for improving the reasoning capabilities of large language models (LLMs)<n>Most existing RL approaches rely on sparse outcome rewards, which fail to credit correct intermediate steps in partially successful solutions.<n>We propose Verifiable Prefix Policy Optimization (VPPO), which uses PRMs only to localize the first error during RL.
arXiv Detail & Related papers (2026-01-26T21:38:20Z)
PROPA: Toward Process-level Optimization in Visual Reasoning via Reinforcement Learning [30.44007644340425]
We introduce PROPA, a novel framework that integrates Monte Carlo Tree Search (MCTS) with GRPO to generate dense, process-level rewards and optimize reasoning at each intermediate step without human annotations.<n>Across seven benchmarks and four VLM backbones, PROPA consistently outperforms both SFT- and RLVR-based baselines.<n>It achieves up to 17.0% gains on in-domain tasks and 21.0% gains on out-of-domain tasks compared to existing state-of-the-art.
arXiv Detail & Related papers (2025-11-13T13:06:12Z)
StepWiser: Stepwise Generative Judges for Wiser Reasoning [52.32416311990343]
Process reward models address this by providing step-by-step feedback.<n>Inspired by recent advances, we reframe stepwise reward modeling from a classification task to a reasoning task itself.<n>We show it provides (i) better judgment accuracy on intermediate steps than existing methods; (ii) can be used to improve the policy model at training time; and (iii) improves inference-time search.
arXiv Detail & Related papers (2025-08-26T17:45:05Z)
SSPO: Self-traced Step-wise Preference Optimization for Process Supervision and Reasoning Compression [15.87106741558898]
Post-training methods incur substantial computational overhead due to auxiliary models and overthinking.<n>We propose Self-traced Step-wise Preference Optimization (SSPO), a plug RLgable process supervision framework.<n>SSPO uses step-wise preference signals generated by the model itself to guide the optimization process for reasoning compression.
arXiv Detail & Related papers (2025-08-18T04:02:15Z)
Explicit Preference Optimization: No Need for an Implicit Reward Model [18.225409932618657]
Direct preference optimization (DPO) and its offshoots circumvent the need for a separate reward training step.<n>We show that DPO-based objectives are nonetheless subject to sub-optimal regularization and counter-intuitive artifacts.
arXiv Detail & Related papers (2025-06-09T07:11:01Z)
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization [86.32257216965229]
We propose a new online reinforcement learning framework that enables MLLMs to self-improve reasoning ability via simple, effective and dense step-wise rewarding. StepGRPO introduces two novel rule-based reasoning rewards: Step-wise Reasoning Accuracy Reward (StepRAR) and Step-wise Reasoning Validity Reward (StepRVR) With the proposed StepGRPO, we introduce R1-VL, a series of MLLMs with outstanding capabilities in step-by-step reasoning.
arXiv Detail & Related papers (2025-03-17T08:51:44Z)
AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence [29.551802573731305]
We propose AdaptiveStep, a method that divides reasoning steps based on the model's confidence in predicting the next word.<n>We demonstrate its effectiveness through experiments with AdaptiveStep-trained PRMs in mathematical reasoning and code generation tasks.
arXiv Detail & Related papers (2025-02-19T18:35:55Z)
SPPD: Self-training with Process Preference Learning Using Dynamic Value Margin [16.346540681903804]
We propose textbfSelf-training framework integrating textbfProcess textbfPreference learning using textbfDynamic value margin (SPPD)<n>Experiments on 7B-scale models demonstrate superior performance across in-domain and out-domain mathematical benchmarks.
arXiv Detail & Related papers (2025-02-19T08:11:26Z)
Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback [94.25162866972077]
Step-KTO is a training framework that combines process-level and outcome-level binary feedback.<n>Our experiments show that Step-KTO significantly improves both final answer accuracy and the quality of intermediate reasoning steps.
arXiv Detail & Related papers (2025-01-18T15:38:03Z)
BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning [83.03531832811386]
BoostStep is a method that enhances reasoning accuracy through step-aligned ICL examples.<n>It integrates seamlessly with chain-of-thought (CoT) and tree search algorithms.<n>It improves DeepSeek-R1-671B's performance on AIME by 2.2%, leveraging simple examples only from the MATH dataset.
arXiv Detail & Related papers (2025-01-06T18:59:13Z)
Aligning Few-Step Diffusion Models with Dense Reward Difference Learning [81.85515625591884]
Stepwise Diffusion Policy Optimization (SDPO) is an alignment method tailored for few-step diffusion models. SDPO incorporates dense reward feedback at every intermediate step to ensure consistent alignment across all denoising steps. SDPO consistently outperforms prior methods in reward-based alignment across diverse step configurations.
arXiv Detail & Related papers (2024-11-18T16:57:41Z)
Uncertainty-Penalized Direct Preference Optimization [52.387088396044206]
We develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes. The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples. We show improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.
arXiv Detail & Related papers (2024-10-26T14:24:37Z)
Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning [38.127313175508746]
Step-Controlled DPO creates negative samples of mathematical reasoning rationales that start making errors at a specified step. By applying these samples in DPO training, SCDPO can better align the model to understand reasoning errors and output accurate reasoning steps.
arXiv Detail & Related papers (2024-06-30T17:59:07Z)
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [54.05511925104712]
We propose a simple, effective, and data-efficient method called Step-DPO. Step-DPO treats individual reasoning steps as units for preference optimization rather than evaluating answers holistically. Our findings demonstrate that as few as 10K preference data pairs and fewer than 500 Step-DPO training steps can yield a nearly 3% gain in accuracy on MATH for models with over 70B parameters.
arXiv Detail & Related papers (2024-06-26T17:43:06Z)
Let's reward step by step: Step-Level reward model as the Navigators for Reasoning [64.27898739929734]
Process-Supervised Reward Model (PRM) furnishes LLMs with step-by-step feedback during the training phase. We propose a greedy search algorithm that employs the step-level feedback from PRM to optimize the reasoning pathways explored by LLMs. To explore the versatility of our approach, we develop a novel method to automatically generate step-level reward dataset for coding tasks and observed similar improved performance in the code generation tasks.
arXiv Detail & Related papers (2023-10-16T05:21:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.