Related papers: SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs

SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs

URL: http://arxiv.org/abs/2412.08347v1
Date: Wed, 11 Dec 2024 12:41:36 GMT
Title: SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs
Authors: Sultan Alrashed,
Abstract summary: We present SmolTulu, an instruction-tuned language model that adapts AllenAI's Tulu 3 post-training pipeline to enhance Huggingface's SmolLM2-1.7B base model.<n>Our findings reveal a clear split: reasoning tasks like ARC and GSM8K benefit from higher learning rate to batch size ratios, while pattern recognition tasks such as HellaSwag and IFEval show optimal performance with lower ratios.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present SmolTulu-1.7b-Instruct, referenced in this report as SmolTulu-DPO-1130, an instruction-tuned language model that adapts AllenAI's Tulu 3 post-training pipeline to enhance Huggingface's SmolLM2-1.7B base model. Through comprehensive empirical analysis using a 135M parameter model, we demonstrate that the relationship between learning rate and batch size significantly impacts model performance in a task-dependent manner. Our findings reveal a clear split: reasoning tasks like ARC and GSM8K benefit from higher learning rate to batch size ratios, while pattern recognition tasks such as HellaSwag and IFEval show optimal performance with lower ratios. These insights informed the development of SmolTulu, which achieves state-of-the-art performance among sub-2B parameter models on instruction following, scoring 67.7% on IFEval ($\Delta$11%), and mathematical reasoning with 51.6% on GSM8K ($\Delta$3.4%), with an alternate version achieving scoring 57.1% on ARC ($\Delta5.4%$). We release our model, training recipes, and ablation studies to facilitate further research in efficient model alignment, demonstrating that careful adaptation of optimization dynamics can help bridge the capability gap between small and large language models.

Related papers

Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models [20.427087561312057]
We introduce Leverage Efficiency (EL), a metric quantifying the computational advantage of an MoE model over a dense equivalent.<n>EL is driven by the expert activation ratio and the total compute budget, both following predictable power laws.<n>We integrate these discoveries into a unified scaling law that accurately predicts the EL of an MoE architecture based on its configuration.
arXiv Detail & Related papers (2025-07-23T17:10:23Z)
Skywork Open Reasoner 1 Technical Report [51.403686909760914]
We present Skywork-OR1, an effective and scalable reinforcement learning (RL) implementation for long Chain-of-Thought (CoT) models.<n>Building on the DeepSeek-R1-Distill model series, our RL approach achieves notable performance gains.<n>Our Skywork-OR1-32B model surpasses both DeepSeek-R1 and Qwen3-32B on the AIME24 and AIME25 benchmarks.
arXiv Detail & Related papers (2025-05-28T12:56:04Z)
Think, Prune, Train, Improve: Scaling Reasoning without Scaling Models [1.96238419451815]
Large language models (LLMs) have demonstrated strong capabilities in programming and mathematical reasoning tasks, but are constrained by limited high-quality training data. We introduce a scalable framework that iteratively fine-tunes models on their own reasoning traces, using ground-truth pruning to ensure high-quality training data. This approach yields improved performance: on GSM8K, Gemma2-2B achieves a Pass@1 of 57.6% (from 41.9%), Gemma2-9B reaches 82%, matching LLaMA-3.1-70B, and LLaMA-3.1-70B attains 91%, even surpassing GPT-4o
arXiv Detail & Related papers (2025-04-25T06:48:55Z)
Enhancing Knowledge Distillation for LLMs with Response-Priming Prompting [1.9461727843485295]
We propose a set of novel response-priming prompting strategies to enhance the performance of student models. Our approach fine-tunes a smaller Llama 3.1 8B Instruct model by distilling knowledge from a quantized Llama 3.1 405B Instruct teacher model. We find that Ground Truth prompting results in a 55% performance increase on GSM8K for a distilled Llama 3.1 8B Instruct.
arXiv Detail & Related papers (2024-12-18T20:41:44Z)
GRIN: GRadient-INformed MoE [132.87651078514122]
Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing. We introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing. Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data.
arXiv Detail & Related papers (2024-09-18T17:00:20Z)
Building Math Agents with Multi-Turn Iterative Preference Learning [56.71330214021884]
This paper studies the complementary direct preference learning approach to further improve model performance. Existing direct preference learning algorithms are originally designed for the single-turn chat task. We introduce a multi-turn direct preference learning framework, tailored for this context.
arXiv Detail & Related papers (2024-09-04T02:41:04Z)
Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models [62.4691912312317]
Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$times$ compared to dense models without sacrificing performance. We propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency.
arXiv Detail & Related papers (2024-04-08T14:39:49Z)
Mixed Distillation Helps Smaller Language Model Better Reasoning [27.934081882868902]
We introduce Mixed Distillation (MD) framework, which capitalizes on the strengths of Program of Thought (PoT) and Chain of Thought (CoT) capabilities within large language models (LLMs) Our experimental results show that MD significantly enhances the single-path and multi-path reasoning ability of smaller models in various tasks.
arXiv Detail & Related papers (2023-12-17T14:28:28Z)
Pruning Large Language Models via Accuracy Predictor [0.0]
Large language models (LLMs) containing tens of billions of parameters (or even more) have demonstrated impressive capabilities in various NLP tasks. We propose a novel pruning approach: firstly, a training set of a certain number of architecture-accuracy pairs is established, and then a non-neural model is trained as an accuracy predictor.
arXiv Detail & Related papers (2023-09-18T06:38:24Z)
Scaling Instruction-Finetuned Language Models [126.4789306516927]
Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance. We find that instruction finetuning dramatically improves performance on a variety of model classes.
arXiv Detail & Related papers (2022-10-20T16:58:32Z)
Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems [63.713297451300086]
We present results from a large-scale experiment on pretraining encoders with non-embedding parameter counts ranging from 700M to 9.3B. Their subsequent distillation into smaller models ranging from 17M-170M parameters, and their application to the Natural Language Understanding (NLU) component of a virtual assistant system.
arXiv Detail & Related papers (2022-06-15T20:44:23Z)
DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing [117.41016786835452]
This paper presents a new pre-trained language model, DeBERTaV3, which improves the original DeBERTa model. vanilla embedding sharing in ELECTRA hurts training efficiency and model performance. We propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamics.
arXiv Detail & Related papers (2021-11-18T06:48:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.