Comparative Analysis and Parametric Tuning of PPO, GRPO, and DAPO for LLM Reasoning Enhancement
- URL: http://arxiv.org/abs/2512.07611v1
- Date: Mon, 08 Dec 2025 14:58:19 GMT
- Title: Comparative Analysis and Parametric Tuning of PPO, GRPO, and DAPO for LLM Reasoning Enhancement
- Authors: Yongsheng Lian,
- Abstract summary: This study presents a systematic comparison of three Reinforcement Learning (RL) algorithms for improving complex reasoning in large language models (LLMs)<n>We find that RL-trained models outperform their corresponding base models, although the degree of improvement differs by benchmark.<n>Increasing the group size in GRPO and DAPO leads to more stable training dynamics and higher accuracy, while the impact of the KL-penalty coefficient is non-monotonic.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study presents a systematic comparison of three Reinforcement Learning (RL) algorithms (PPO, GRPO, and DAPO) for improving complex reasoning in large language models (LLMs). Our main contribution is a controlled transfer-learning evaluation: models are first fine-tuned on the specialized Countdown Game and then assessed on a suite of general-purpose reasoning benchmarks. Across all tasks, RL-trained models outperform their corresponding base models, although the degree of improvement differs by benchmark. Our parametric analysis offers practical guidance for RL-based LLM training. Increasing the group size in GRPO and DAPO leads to more stable training dynamics and higher accuracy, while the impact of the KL-penalty coefficient is non-monotonic. Additionally, we find that the Dynamic Sampling (DS) component in DAPO does not improve performance; in fact, the best overall results are achieved with DAPO when DS is disabled.
Related papers
- Data Distribution as a Lever for Guiding Optimizers Toward Superior Generalization in LLMs [60.68927774057402]
We show, for the first time, that a lower simplicity bias induces a better generalization.<n>Motivated by this insight, we demonstrate that the training data distribution by upsampling or augmenting examples learned later in training similarly reduces SB and leads to improved generalization.<n>Our strategy improves the performance of multiple language models including Phi2-2.7B, Llama3.2-1B, Gemma3-1B-PT, Qwen3-0.6B-Base-achieving relative accuracy gains up to 18% when fine-tuned with AdamW and Muon.
arXiv Detail & Related papers (2026-01-31T07:40:36Z) - A Comedy of Estimators: On KL Regularization in RL Training of LLMs [81.7906270099878]
reinforcement learning (RL) can substantially improve the reasoning performance of large language models (LLMs)<n>The RL objective for LLM training involves a regularization term, which is the reverse Kullback-Leibler (KL) divergence between the trained policy and the reference policy.<n>Recent works show that prevailing practices for incorporating KL regularization do not provide correct gradients for stated objectives, creating a discrepancy between the objective and its implementation.<n>We study the gradients of several estimators configurations, revealing how design choices shape gradient bias.
arXiv Detail & Related papers (2025-12-26T04:20:58Z) - A First-Order Logic-Based Alternative to Reward Models in RLHF [0.0]
Reinforcement Learning from Human Feedback plays a crucial role in aligning large language models with human values and preferences.<n>Existing approaches rely heavily on reward models to guide language models toward human-aligned behaviors.<n>We propose a logic-similarity-based reward mechanism as an alternative to conventional reward modeling.
arXiv Detail & Related papers (2025-12-16T05:15:17Z) - Conditional Advantage Estimation for Reinforcement Learning in Large Reasoning Models [50.84995206660551]
We introduce Conditional advANtage estimatiON (CANON) to amplify the impact of a target metric without presuming its direction.<n>CANON based on entropy consistently outperforms prior methods on both math reasoning and high-complexity logic tasks.
arXiv Detail & Related papers (2025-09-28T16:33:07Z) - Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning [93.00629872970364]
Reinforcement learning (RL) has become the dominant paradigm for improving the performance of language models on complex reasoning tasks.<n>We introduce SPARKLE, a fine-grained analytic framework to dissect the effects of RL across three key dimensions.<n>We study whether difficult problems -- those yielding no RL signals and mixed-quality reasoning traces -- can still be effectively used for training.
arXiv Detail & Related papers (2025-06-05T07:53:59Z) - KDRL: Post-Training Reasoning LLMs via Unified Knowledge Distillation and Reinforcement Learning [72.53466291156604]
We present textbfKDRL, a textitunified post-training framework that jointly optimize a reasoning model through teacher supervision (KD) and self-exploration (RL)<n>We first formulate a unified objective that integrates GRPO and KD, and systematically explore how different KL approximations, KL coefficients, and reward-guided KD strategies affect the overall post-training dynamics and performance.
arXiv Detail & Related papers (2025-06-02T19:46:41Z) - Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining [74.83412846804977]
Reinforcement learning (RL)-based fine-tuning has become a crucial step in post-training language models.<n>We present a systematic end-to-end study of RL fine-tuning for mathematical reasoning by training models entirely from scratch.
arXiv Detail & Related papers (2025-04-10T17:15:53Z) - Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation [29.579349371114702]
Direct Preference Optimization (DPO) is a cost-effective alternative to reinforcement learning (RL) for large language models (LLMs)<n>We show that a single round of DPO with coarse filtering significantly enhances mathematical reasoning performance.<n>With simple verifiable rewards, our model achieves RL-level performance with significantly lower computational overhead.
arXiv Detail & Related papers (2025-03-17T06:28:25Z) - Improving Multi-Step Reasoning Abilities of Large Language Models with Direct Advantage Policy Optimization [22.67700436936984]
We introduce Direct Advantage Policy Optimization (DAPO), a novel step-level offline reinforcement learning algorithm.<n>DAPO employs a critic function to predict the reasoning accuracy at each step, thereby generating dense signals to refine the generation strategy.<n>Our results show that DAPO can effectively enhance the mathematical and code capabilities on both SFT models and RL models, demonstrating the effectiveness of DAPO.
arXiv Detail & Related papers (2024-12-24T08:39:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.