Related papers: Data movement limits to frontier model training

Related papers

Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better [24.03797089794804]
We propose a Late-to-Early Training (LET) paradigm that enables Large Language Models to learn later knowledge in earlier steps and earlier layers.<n>We identify two key mechanisms that drive LET's effectiveness: late-to-early-step learning and late-to-early-layer learning.<n>Our method achieves up to 1.6$times$ speedup with nearly 5% improvement in downstream task accuracy compared to standard training.
arXiv Detail & Related papers (2026-02-05T07:19:34Z)
Deep Progressive Training: scaling up depth capacity of zero/one-layer models [19.649807308477527]
We study the depth expansion of large models through the lens of optimization theory.<n>We propose zero/one-layer progressive training for the optimal tradeoff between computation and loss.
arXiv Detail & Related papers (2025-11-07T04:56:45Z)
Predictive Scaling Laws for Efficient GRPO Training of Large Reasoning Models [0.41942958779358663]
We propose a predictive framework that models training dynamics and helps optimize resource usage.<n>We derive an empirical scaling law based on model size, initial performance, and training progress.<n>We find that training beyond certain number of an epoch offers little gain, suggesting earlier stopping can significantly reduce compute without sacrificing performance.
arXiv Detail & Related papers (2025-07-24T01:09:25Z)
Critical Batch Size Revisited: A Simple Empirical Approach to Large-Batch Language Model Training [54.860881625923184]
We show how a critical batch size (CBS) can be estimated based on the gradient noise scale during training.<n>Our findings about how the CBS changes over training motivate batch size warmup, suggesting CBS from small training runs can inform larger-scale training runs.
arXiv Detail & Related papers (2025-05-29T19:53:39Z)
Overtrained Language Models Are Harder to Fine-Tune [64.44743256512237]
Large language models are pre-trained on ever-growing token budgets. We show that extended pre-training can make models harder to fine-tune, leading to degraded final performance.
arXiv Detail & Related papers (2025-03-24T23:11:56Z)
2 OLMo 2 Furious [154.15728448754854]
We present OLMo 2, the next generation of our fully open language models.<n> OLMo 2 includes a family of dense autoregressive language models at 7B, 13B and 32B scales.<n>We describe our modified model architecture and training recipe.
arXiv Detail & Related papers (2024-12-31T21:55:10Z)
Beyond Next Token Prediction: Patch-Level Training for Large Language Models [69.67438563485887]
We introduce patch-level training for Large Language Models (LLMs)<n>During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch.<n>We show that patch-level training can reduce the overall training costs to 0.5$times$, without compromising the model performance.
arXiv Detail & Related papers (2024-07-17T15:48:39Z)
Efficient Continual Pre-training by Mitigating the Stability Gap [68.49269649759005]
We study the behavior of Large Language Models (LLMs) during continual pre-training. We propose three effective strategies to enhance LLM performance within a fixed compute budget. Our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% with only 40% of the original training budget.
arXiv Detail & Related papers (2024-06-21T02:28:37Z)
Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations [62.132347451049455]
Scale has become a main ingredient in obtaining strong machine learning models. In this work, we argue that scale and training research has been needlessly complex due to reliance on the cosine schedule. We show that weight averaging yields improved performance along the training trajectory, without additional training costs, across different scales.
arXiv Detail & Related papers (2024-05-28T17:33:54Z)
Preparing Lessons for Progressive Training on Language Models [75.88952808979087]
The rapid progress of Transformers in artificial intelligence has come at the cost of increased resource consumption and greenhouse gas emissions. We propose Apollo, which preptextbfares lessons for extextbfpanding textbfoperations by textbflayer functitextbfonality during training of low layers. Experiments demonstrate that Apollo achieves state-of-the-art acceleration ratios, even rivaling methods using pretrained models.
arXiv Detail & Related papers (2024-01-17T13:04:14Z)
An Emulator for Fine-Tuning Large Language Models using Small Language Models [91.02498576056057]
We introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales. We show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training. Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models.
arXiv Detail & Related papers (2023-10-19T17:57:16Z)
Early Weight Averaging meets High Learning Rates for LLM Pre-training [20.671831210738937]
We show that models trained with high learning rates observe higher gains due to checkpoint averaging. Our training recipe outperforms conventional training and popular checkpoint averaging baselines.
arXiv Detail & Related papers (2023-06-05T20:51:44Z)
SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models [4.114555639014612]
We show the benefits of using unstructured weight sparsity to train only a subset of weights during pre-training. We demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3 XL model resulting in a 2.5x reduction in pre-training FLOPs.
arXiv Detail & Related papers (2023-03-18T17:56:01Z)
Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch Size [58.762959061522736]
We show that scaling mini-batch sizes with appropriate learning rate adjustments can speed up the training process by orders of magnitude. We show that scaling the mini-batch size and naively adjusting the learning rate allows for (1) a reduced size of the Q-ensemble, (2) stronger penalization of out-of-distribution actions, and (3) improved convergence time.
arXiv Detail & Related papers (2022-11-20T21:48:25Z)
Efficient NLP Model Finetuning via Multistage Data Filtering [11.058786955754004]
We set to filter training examples in a streaming fashion, in tandem with training the target model. Our key techniques are (1) automatically determine a training loss threshold for skipping backward training passes; (2) run a meta predictor for further skipping forward training passes. Our method reduces the required training examples by up to 5.3$times$ and training time by up to 6.8$times$, while only seeing minor accuracy degradation.
arXiv Detail & Related papers (2022-07-28T21:43:31Z)
Staged Training for Transformer Language Models [47.99321376123886]
We consider a staged training setup that begins with a small model and incrementally increases the amount of compute used for training. By initializing each stage with the output of the previous one, the training process effectively re-uses the compute. We empirically validate our growth operators and staged training for autoregressive language models, showing up to 22% compute savings.
arXiv Detail & Related papers (2022-03-11T19:05:42Z)
$\ell_\infty$-Robustness and Beyond: Unleashing Efficient Adversarial Training [11.241749205970253]
We show how selecting a small subset of training data provides a more principled approach towards reducing the time complexity of robust training. Our approach speeds up adversarial training by 2-3 times, while experiencing a small reduction in the clean and robust accuracy.
arXiv Detail & Related papers (2021-12-01T09:55:01Z)
Curriculum Learning: A Regularization Method for Efficient and Stable Billion-Scale GPT Model Pre-Training [18.640076155697415]
We present a study of a curriculum learning based approach, which helps improve the pre-training convergence speed of autoregressive models. Our evaluations demonstrate that curriculum learning enables training GPT-2 models with 8x larger batch size and 4x larger learning rate.
arXiv Detail & Related papers (2021-08-13T06:32:53Z)
EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets [106.79387235014379]
EarlyBERT is a general computationally-efficient training algorithm applicable to both pre-training and fine-tuning of large-scale language models. We are the first to identify structured winning tickets in the early stage of BERT training, and use them for efficient training. EarlyBERT easily achieves comparable performance to standard BERT with 3545% less training time.
arXiv Detail & Related papers (2020-12-31T20:38:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.