Related papers: Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo

Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo

URL: http://arxiv.org/abs/2503.09799v1
Date: Wed, 12 Mar 2025 20:04:38 GMT
Title: Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo
Authors: Zachary Charles, Gabriel Teston, Lucio Dery, Keith Rush, Nova Fallen, Zachary Garrett, Arthur Szlam, Arthur Douillard,
Abstract summary: We study the scaling law behavior of DiLoCo when training LLMs under a fixed compute budget.<n>We find that DiLoCo scales both predictably and robustly with model size.<n>When well-tuned, DiLoCo scales better than data-parallel training with model size, and can outperform data-parallel training even at small model sizes.
Score: 22.7130140114906
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As we scale to more massive machine learning models, the frequent synchronization demands inherent in data-parallel approaches create significant slowdowns, posing a critical challenge to further scaling. Recent work develops an approach (DiLoCo) that relaxes synchronization demands without compromising model quality. However, these works do not carefully analyze how DiLoCo's behavior changes with model size. In this work, we study the scaling law behavior of DiLoCo when training LLMs under a fixed compute budget. We focus on how algorithmic factors, including number of model replicas, hyperparameters, and token budget affect training in ways that can be accurately predicted via scaling laws. We find that DiLoCo scales both predictably and robustly with model size. When well-tuned, DiLoCo scales better than data-parallel training with model size, and can outperform data-parallel training even at small model sizes. Our results showcase a more general set of benefits of DiLoCo than previously documented, including increased optimal batch sizes, improved downstream generalization with scale, and improved evaluation loss for a fixed token budget.

Related papers

Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining [74.83412846804977]
Reinforcement learning (RL)-based fine-tuning has become a crucial step in post-training language models. We present a systematic end-to-end study of RL fine-tuning for mathematical reasoning by training models entirely from scratch.
arXiv Detail & Related papers (2025-04-10T17:15:53Z)
Scaling Laws for Upcycling Mixture-of-Experts Language Models [17.796361238003403]
Pretraining large language models (LLMs) is resource-intensive, often requiring months of training time even with high-end GPU clusters.<n>There are two approaches of mitigating such computational demands: reusing smaller models to train larger ones (upcycling) and training computationally efficient models like mixture-of-experts (MoE)
arXiv Detail & Related papers (2025-02-05T09:11:13Z)
A Hitchhiker's Guide to Scaling Law Estimation [56.06982415792523]
Scaling laws predict the loss of a target machine learning model by extrapolating from easier-to-train models with fewer parameters or smaller training sets. We estimate more than 1000 scaling laws, then derive a set of best practices for estimating scaling laws in new model families.
arXiv Detail & Related papers (2024-10-15T17:59:10Z)
OmniBal: Towards Fast Instruct-tuning for Vision-Language Models via Omniverse Computation Balance [65.48009829137824]
Large-scale 3D parallel training on vision-language instruct-tuning models leads to an imbalanced computation load across different devices. We rebalanced the computational loads from data, model, and memory perspectives to address this issue. Our method's efficacy and generalizability were further demonstrated across various models and datasets.
arXiv Detail & Related papers (2024-07-30T12:02:58Z)
AutoScale: Scale-Aware Data Mixing for Pre-Training LLMs [61.13296177652599]
We show that data mixtures that perform well at smaller scales may not retain their advantage at larger scales. We propose AutoScale, a two-stage, scale-aware data composition framework.
arXiv Detail & Related papers (2024-07-29T17:06:30Z)
Observational Scaling Laws and the Predictability of Language Model Performance [51.2336010244645]
We propose an observational approach that bypasses model training and instead builds scaling laws from 100 publically available models. We show that several emergent phenomena follow a smooth, sigmoidal behavior and are predictable from small models. We show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as language model capabilities continue to improve.
arXiv Detail & Related papers (2024-05-17T17:49:44Z)
More Compute Is What You Need [3.184416958830696]
We propose a new scaling law that suggests model performance depends mostly on the amount of compute spent for transformer-based models. We predict that (a) for inference efficiency, training should prioritize smaller model sizes and larger training datasets, and (b) assuming the exhaustion of available web datasets, scaling the model size might be the only way to further improve model performance.
arXiv Detail & Related papers (2024-04-30T12:05:48Z)
A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization. Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z)
Scaling Laws for Acoustic Models [7.906034575114518]
Recent work has shown that autoregressive generative models with cross-entropy objective functions exhibit smooth power-law relationships. We show that acoustic models trained with an auto-predictive coding loss behave as if they are subject to similar scaling laws.
arXiv Detail & Related papers (2021-06-11T18:59:24Z)
Scaling Laws for Neural Language Models [14.472857826717613]
We study scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training.
arXiv Detail & Related papers (2020-01-23T03:59:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.