MDPO: Multi-Granularity Direct Preference Optimization for Mathematical Reasoning
- URL: http://arxiv.org/abs/2506.15706v1
- Date: Fri, 30 May 2025 08:42:14 GMT
- Title: MDPO: Multi-Granularity Direct Preference Optimization for Mathematical Reasoning
- Authors: Yunze Lin,
- Abstract summary: We propose the Multi-Granularity Direct Preference Optimization (MDPO) method, optimizing the mathematical reasoning of Large Language Models (LLMs)<n>We conduct experiments on the open-source models Qwen2 and Llama3, achieving improvements of 1.7% and 1.2% on the GSM8K dataset, and 2.3% and 1.2% on the MATH dataset.<n>We also provide a pipeline for constructing MDPO training data that is simple and does not require manual annotation costs.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mathematical reasoning presents a significant challenge for Large Language Models (LLMs) as it requires ensuring the correctness of each reasoning step. Researchers have been strengthening the mathematical reasoning abilities of LLMs through supervised fine-tuning, but due to the inability to suppress incorrect outputs, illusions can easily arise. Recently, Direct Preference Optimization (DPO) has been widely adopted for aligning human intent by using preference data to prevent LLMs from generating incorrect outputs. However, it has shown limited benefits in long-chain mathematical reasoning, mainly because DPO struggles to effectively capture the differences between accepted and rejected answers from preferences in long-chain data. The inconsistency between DPO training and LLMs' generation metrics also affects the effectiveness of suppressing incorrect outputs. We propose the Multi-Granularity Direct Preference Optimization (MDPO) method, optimizing the mathematical reasoning of LLMs at three granularities: Solution2Solution, Inference2Inference, and Step2Step. Solution2Solution focuses on the correctness of entire long-chain reasoning; Inference2Inference concentrates on logical reasoning between steps; Step2Step corrects computational errors in steps, enhancing the computational capabilities of LLMs. Additionally, we unify the training objectives of the three granularities to align with the generation metrics. We conducted experiments on the open-source models Qwen2 and Llama3, achieving improvements of 1.7% and 0.9% on the GSM8K dataset, and 2.3% and 1.2% on the MATH dataset, outperforming DPO and other DPO variant methods. Furthermore, we also provide a pipeline for constructing MDPO training data that is simple and does not require manual annotation costs.
Related papers
- Effective Reinforcement Learning for Reasoning in Language Models [30.994610715391776]
Reinforcement learning (RL) has emerged as a promising strategy for improving the reasoning capabilities of language models (LMs) in domains such as mathematics and coding.<n>We analyze RL algorithm design decisions for LM reasoning, focusing on relatively small models due to computational constraints.<n>Our findings are: (i) on-policy RL significantly outperforms supervised fine-tuning (SFT), (ii) PPO-based off-policy updates increase accuracy instead of reduce variance, and (iii) removing KL divergence can lead to more concise generations and higher accuracy.
arXiv Detail & Related papers (2025-05-22T18:48:09Z) - Modality-Balancing Preference Optimization of Large Multimodal Models by Adversarial Negative Mining [66.54211199959298]
We propose a novel preference learning framework, Modality-Balancing Preference Optimization (MBPO), to address the modality imbalance in LMMs.<n>MBPO constructs a more effective offline preference dataset by generating hard negatives, i.e., rejected responses misled by LLM biases.<n>It can enhance LMM performance on challenging vision-language tasks and effectively reduce hallucinations.
arXiv Detail & Related papers (2025-05-20T03:59:05Z) - S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference.<n>Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z) - TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization With Estimated Weights [73.9088920210495]
We propose a token-level importance sampling DPO objective named TIS-DPO that assigns importance weights to each token based on its reward.<n>TIS-DPO significantly outperforms various baseline methods on harmlessness and helpfulness alignment and summarization tasks.
arXiv Detail & Related papers (2024-10-06T04:03:00Z) - Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [54.05511925104712]
We propose a simple, effective, and data-efficient method called Step-DPO.
Step-DPO treats individual reasoning steps as units for preference optimization rather than evaluating answers holistically.
Our findings demonstrate that as few as 10K preference data pairs and fewer than 500 Step-DPO training steps can yield a nearly 3% gain in accuracy on MATH for models with over 70B parameters.
arXiv Detail & Related papers (2024-06-26T17:43:06Z) - 3D-Properties: Identifying Challenges in DPO and Charting a Path Forward [17.27880657597116]
We revisit DPO, analyzing its theoretical foundations and empirical performance.<n>We identify three key properties, termed 3D properties, that emerge from DPO's learning process.<n>We propose simple regularization techniques that improve training stability and performance.
arXiv Detail & Related papers (2024-06-11T14:59:24Z) - MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models.
It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths.
It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z) - Model Extrapolation Expedites Alignment [135.12769233630362]
We propose a method called ExPO to expedite alignment training with human preferences.<n>We demonstrate that ExPO boosts a DPO model trained with only 20% steps to outperform the fully-trained one.<n>We show that ExPO notably improves existing open-source LLMs on the leading AlpacaEval 2.0 and MT-Bench benchmarks.
arXiv Detail & Related papers (2024-04-25T17:39:50Z) - Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model [3.300814846990438]
Large Language Models (LLMs) have become increasingly popular due to their ability to process and generate natural language.<n>As they are trained on massive datasets of text, LLMs can inherit harmful biases and produce outputs that are not aligned with human values.<n>This paper studies two main approaches to LLM alignment: Reinforcement Learning with Human Feedback (RLHF) and contrastive learning-based methods like Direct Preference Optimization (DPO)<n>By analyzing the stability and robustness of RLHF and DPO, we propose MPO, a novel method that mitigates the weaknesses of both approaches.
arXiv Detail & Related papers (2024-03-28T14:15:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.