Predictive Scaling Laws for Efficient GRPO Training of Large Reasoning Models
- URL: http://arxiv.org/abs/2507.18014v1
- Date: Thu, 24 Jul 2025 01:09:25 GMT
- Title: Predictive Scaling Laws for Efficient GRPO Training of Large Reasoning Models
- Authors: Datta Nimmaturi, Vaishnavi Bhargava, Rajat Ghosh, Johnu George, Debojyoti Dutta,
- Abstract summary: We propose a predictive framework that models training dynamics and helps optimize resource usage.<n>We derive an empirical scaling law based on model size, initial performance, and training progress.<n>We find that training beyond certain number of an epoch offers little gain, suggesting earlier stopping can significantly reduce compute without sacrificing performance.
- Score: 0.41942958779358663
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Fine-tuning large language models (LLMs) for reasoning tasks using reinforcement learning methods like Group Relative Policy Optimization (GRPO) is computationally expensive. To address this, we propose a predictive framework that models training dynamics and helps optimize resource usage. Through experiments on Llama and Qwen models (3B 8B), we derive an empirical scaling law based on model size, initial performance, and training progress. This law predicts reward trajectories and identifies three consistent training phases: slow start, rapid improvement, and plateau. We find that training beyond certain number of an epoch offers little gain, suggesting earlier stopping can significantly reduce compute without sacrificing performance. Our approach generalizes across model types, providing a practical guide for efficient GRPO-based fine-tuning.
Related papers
- It Takes a Good Model to Train a Good Model: Generalized Gaussian Priors for Optimized LLMs [15.263422862969803]
We introduce BackSlash, a training-time compression algorithm for large language models.<n>We propose a unified, end-to-end framework for LLM optimization based on the GG model.<n>Our contributions are threefold:.<n>DeepShape, a post-training regularization method that reshapes weight distributions to match a GG profile,.<n>RF8, a compact and hardware-efficient 8-bit floating-point format designed for GG-distributed-priord BackSlash training.
arXiv Detail & Related papers (2025-05-31T09:49:17Z) - The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws [51.608402959163925]
We present the first systematic exploration of optimal sparse pre-training configurations for large language models.<n>We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss.<n>We propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training.
arXiv Detail & Related papers (2025-01-21T20:23:22Z) - Task-Oriented Pre-Training for Drivable Area Detection [5.57325257338134]
We propose a task-oriented pre-training method that begins with generating redundant segmentation proposals.
We then introduce a Specific Category Enhancement Fine-tuning (SCEF) strategy for fine-tuning the Contrastive Language-Image Pre-training (CLIP) model.
This approach can generate a lot of coarse training data for pre-training models, which are further fine-tuned using manually annotated data.
arXiv Detail & Related papers (2024-09-30T10:25:47Z) - Optimization Hyper-parameter Laws for Large Language Models [52.49860340549727]
We present Opt-Laws, a framework that captures the relationship between hyper- parameters and training outcomes.<n>Our validation across diverse model sizes and data scales demonstrates Opt-Laws' ability to accurately predict training loss.<n>This approach significantly reduces computational costs while enhancing overall model performance.
arXiv Detail & Related papers (2024-09-07T09:37:19Z) - Efficient Continual Pre-training by Mitigating the Stability Gap [68.49269649759005]
We study the behavior of Large Language Models (LLMs) during continual pre-training.
We propose three effective strategies to enhance LLM performance within a fixed compute budget.
Our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% with only 40% of the original training budget.
arXiv Detail & Related papers (2024-06-21T02:28:37Z) - Optimizing Large Model Training through Overlapped Activation Recomputation [24.28543166026873]
We present Lynx, a new recomputation framework to reduce overhead by overlapping recomputation with communication in training pipelines.<n>Our comprehensive evaluation using GPT models with 1.3B-23B parameters shows that Lynx outperforms existing recomputation approaches by up to 1.37x.
arXiv Detail & Related papers (2024-06-13T02:31:36Z) - More Compute Is What You Need [3.184416958830696]
We propose a new scaling law that suggests model performance depends mostly on the amount of compute spent for transformer-based models.
We predict that (a) for inference efficiency, training should prioritize smaller model sizes and larger training datasets, and (b) assuming the exhaustion of available web datasets, scaling the model size might be the only way to further improve model performance.
arXiv Detail & Related papers (2024-04-30T12:05:48Z) - Preparing Lessons for Progressive Training on Language Models [75.88952808979087]
The rapid progress of Transformers in artificial intelligence has come at the cost of increased resource consumption and greenhouse gas emissions.
We propose Apollo, which preptextbfares lessons for extextbfpanding textbfoperations by textbflayer functitextbfonality during training of low layers.
Experiments demonstrate that Apollo achieves state-of-the-art acceleration ratios, even rivaling methods using pretrained models.
arXiv Detail & Related papers (2024-01-17T13:04:14Z) - Navigating Scaling Laws: Compute Optimality in Adaptive Model Training [39.96209967632896]
In recent years, the state-of-the-art in deep learning has been dominated by very large models that have been pre-trained on vast amounts of data.
We extend the concept of optimality by allowing for an adaptive' model, i.e. a model that can change its shape during training.
arXiv Detail & Related papers (2023-11-06T16:20:28Z) - An Emulator for Fine-Tuning Large Language Models using Small Language
Models [91.02498576056057]
We introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales.
We show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training.
Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models.
arXiv Detail & Related papers (2023-10-19T17:57:16Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.