Think, Prune, Train, Improve: Scaling Reasoning without Scaling Models
- URL: http://arxiv.org/abs/2504.18116v1
- Date: Fri, 25 Apr 2025 06:48:55 GMT
- Title: Think, Prune, Train, Improve: Scaling Reasoning without Scaling Models
- Authors: Caia Costello, Simon Guo, Anna Goldie, Azalia Mirhoseini,
- Abstract summary: Large language models (LLMs) have demonstrated strong capabilities in programming and mathematical reasoning tasks, but are constrained by limited high-quality training data.<n>We introduce a scalable framework that iteratively fine-tunes models on their own reasoning traces, using ground-truth pruning to ensure high-quality training data.<n>This approach yields improved performance: on GSM8K, Gemma2-2B achieves a Pass@1 of 57.6% (from 41.9%), Gemma2-9B reaches 82%, matching LLaMA-3.1-70B, and LLaMA-3.1-70B attains 91%, even surpassing GPT-4o
- Score: 1.96238419451815
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have demonstrated strong capabilities in programming and mathematical reasoning tasks, but are constrained by limited high-quality training data. Synthetic data can be leveraged to enhance fine-tuning outcomes, but several factors influence this process, including model size, synthetic data volume, pruning strategy, and number of fine-tuning rounds. We explore these axes and investigate which conditions enable model self-improvement. We introduce the Think, Prune, Train process, a scalable framework that iteratively fine-tunes models on their own reasoning traces, using ground-truth pruning to ensure high-quality training data. This approach yields improved performance: on GSM8K, Gemma2-2B achieves a Pass@1 of 57.6% (from 41.9%), Gemma2-9B reaches 82%, matching LLaMA-3.1-70B, and LLaMA-3.1-70B attains 91%, even surpassing GPT-4o, demonstrating the effectiveness of self-generated reasoning and systematic data selection for improving LLM capabilities.
Related papers
- S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference.
Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z) - LIMR: Less is More for RL Scaling [25.477841726836836]
We introduce Learning Impact Measurement (LIM), an automated method to evaluate and prioritize training samples.<n>Our method achieves comparable or even superior performance using only 1,389 samples versus the full 8,523 samples dataset.<n>For reproducible research and future innovation, we are open-sourcing LIMR, including implementation of LIM, training and evaluation code, curated datasets, and trained models.
arXiv Detail & Related papers (2025-02-17T15:13:29Z) - The Best Instruction-Tuning Data are Those That Fit [17.401088816596054]
Supervised fine-tuning (SFT) data are crucial for eliciting strong capabilities from pretrained large language models (LLMs)<n>We propose **GRAPE**, a novel SFT framework that accounts for the unique characteristics of the target model.<n>For each instruction, it gathers responses from various LLMs and selects the one with the highest probability measured by the target model.
arXiv Detail & Related papers (2025-02-06T16:31:21Z) - What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy.
By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z) - Self-Data Distillation for Recovering Quality in Pruned Large Language Models [1.5665059604715017]
One-shot pruning results in significant quality degradation, particularly in tasks requiring multi-step reasoning.
To recover lost quality, supervised fine-tuning (SFT) is commonly applied, but it can lead to catastrophic forgetting.
In this work, we utilize self-data distilled fine-tuning to address these challenges.
arXiv Detail & Related papers (2024-10-13T19:53:40Z) - Building Math Agents with Multi-Turn Iterative Preference Learning [56.71330214021884]
This paper studies the complementary direct preference learning approach to further improve model performance.<n>Existing direct preference learning algorithms are originally designed for the single-turn chat task.<n>We introduce a multi-turn direct preference learning framework, tailored for this context.
arXiv Detail & Related papers (2024-09-04T02:41:04Z) - Crafting Efficient Fine-Tuning Strategies for Large Language Models [2.633490094119608]
Fine-tuning large language models (LLMs) with as few as 200 samples can improve model accuracy from 70% to 88% in a product attribute extraction task.
A bayesian hyperparameter optimization method, which evaluates models at 20% of total training time, correlates strongly with final model performance.
This approach led to a 2% improvement in accuracy over baseline models when evaluated on an independent test set.
arXiv Detail & Related papers (2024-07-18T21:36:00Z) - Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process.
We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals.
The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z) - Common 7B Language Models Already Possess Strong Math Capabilities [61.61442513067561]
This paper shows that the LLaMA-2 7B model with common pre-training already exhibits strong mathematical abilities.
The potential for extensive scaling is constrained by the scarcity of publicly available math questions.
arXiv Detail & Related papers (2024-03-07T18:00:40Z) - Pruning Large Language Models via Accuracy Predictor [0.0]
Large language models (LLMs) containing tens of billions of parameters (or even more) have demonstrated impressive capabilities in various NLP tasks.
We propose a novel pruning approach: firstly, a training set of a certain number of architecture-accuracy pairs is established, and then a non-neural model is trained as an accuracy predictor.
arXiv Detail & Related papers (2023-09-18T06:38:24Z) - Scaling Relationship on Learning Mathematical Reasoning with Large
Language Models [75.29595679428105]
We investigate how the pre-training loss, supervised data amount, and augmented data amount influence the reasoning performances of a supervised LLM.
We find that rejection samples from multiple models push LLaMA-7B to an accuracy of 49.3% on GSM8K which outperforms the supervised fine-tuning (SFT) accuracy of 35.9% significantly.
arXiv Detail & Related papers (2023-08-03T15:34:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.