Related papers: Scaling Law for Language Models Training Considering Batch Size

Scaling Law for Language Models Training Considering Batch Size

URL: http://arxiv.org/abs/2412.01505v1
Date: Mon, 02 Dec 2024 13:58:35 GMT
Title: Scaling Law for Language Models Training Considering Batch Size
Authors: Xian Shuai, Yiding Wang, Yimeng Wu, Xin Jiang, Xiaozhe Ren,
Abstract summary: Large language models (LLMs) have made remarkable advances in recent years, with scaling laws playing a critical role in this rapid progress.<n>We empirically investigate how a critical hyper- parameter, i.e., the global batch size, influences the LLM training prdocess.<n>We establish a basic scaling law on model size and training data amount.<n>We then examine how varying batch sizes and learning rates affect the convergence and generalization of these models.
Score: 17.09348741898811
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have made remarkable advances in recent years, with scaling laws playing a critical role in this rapid progress. In this paper, we empirically investigate how a critical hyper-parameter, i.e., the global batch size, influences the LLM training prdocess. We begin by training language models ranging from 125 million to 2.6 billion parameters, using up to 300 billion high-quality tokens. Through these experiments, we establish a basic scaling law on model size and training data amount. We then examine how varying batch sizes and learning rates affect the convergence and generalization of these models. Our analysis yields batch size scaling laws under two different cases: with a fixed compute budget, and with a fixed amount of training data. Extrapolation experiments on models of increasing sizes validate our predicted laws, which provides guidance for optimizing LLM training strategies under specific resource constraints.

Related papers

Scaling Inference-Efficient Language Models [3.271571137474847]
We show that model architecture affects inference latency, where models of the same size can have up to 3.5x difference in latency. We modify the Chinchilla scaling laws to co-optimize the model parameter count, the number of training tokens, and the model architecture. We release the Morph-1B model, which improves inference latency by 1.8x while maintaining accuracy on downstream tasks compared to open-source models.
arXiv Detail & Related papers (2025-01-30T03:16:44Z)
Sloth: scaling laws for LLM skills to predict multi-benchmark performance across families [43.36524246307057]
Scaling laws for large language models (LLMs) predict performance based on parameters like size and training data. We propose Skills Scaling Laws (SSLaws), a novel scaling law that leverages publicly available benchmark data. We present both theoretical results on parameter identification and empirical evaluations on 12 prominent benchmarks.
arXiv Detail & Related papers (2024-12-09T14:51:26Z)
Scaling Laws for Post Training Quantized Large Language Models [41.78467383320145]
Generalization abilities of well-trained large language models (LLMs) are known to scale predictably as a function of model size. The quality of LLMs after post-training compression remains highly unpredictable, often requiring case-by-case validation in practice.
arXiv Detail & Related papers (2024-10-15T23:34:22Z)
LLAVADI: What Matters For Multimodal Large Language Models Distillation [77.73964744238519]
In this work, we do not propose a new efficient model structure or train small-scale MLLMs from scratch. Our studies involve training strategies, model choices, and distillation algorithms in the knowledge distillation process. By evaluating different benchmarks and proper strategy, even a 2.7B small-scale model can perform on par with larger models with 7B or 13B parameters.
arXiv Detail & Related papers (2024-07-28T06:10:47Z)
Observational Scaling Laws and the Predictability of Language Model Performance [51.2336010244645]
We propose an observational approach that bypasses model training and instead builds scaling laws from 100 publically available models. We show that several emergent phenomena follow a smooth, sigmoidal behavior and are predictable from small models. We show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as language model capabilities continue to improve.
arXiv Detail & Related papers (2024-05-17T17:49:44Z)
Temporal Scaling Law for Large Language Models [24.12384260752973]
We propose the novel concept of Temporal Scaling Law, studying how the test loss of an LLM evolves as the training steps scale up. In contrast to modeling the test loss as a whole in a coarse-grained manner, we break it down and dive into the fine-grained test loss of each token position. We derive the much more precise temporal scaling law by studying the temporal patterns of the parameters in the dynamic hyperbolic-law.
arXiv Detail & Related papers (2024-04-27T05:49:11Z)
The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis [27.310894780313618]
This paper undertakes a comprehensive comparison of model capabilities at various pretraining intermediate checkpoints. We confirm that specific downstream metrics exhibit similar training dynamics across models of different sizes. In addition to our core findings, we've reproduced Amber and OpenLLaMA, releasing their intermediate checkpoints.
arXiv Detail & Related papers (2024-04-01T16:00:01Z)
Unraveling the Mystery of Scaling Laws: Part I [39.967120253159614]
Scaling law principles indicate a power-law correlation between loss and variables such as model size, dataset size, and computational resources utilized during training. The original scaling law paper by OpenAI did not disclose the complete details necessary to derive the precise scaling law formulas. We provide step-by-step instructions to estimate all constant terms in scaling-law formulas by training on models with only 1M60M parameters.
arXiv Detail & Related papers (2024-03-11T10:05:29Z)
Mixtures of Experts Unlock Parameter Scaling for Deep RL [54.26191237981469]
In this paper, we demonstrate that incorporating Mixture-of-Expert (MoE) modules into value-based networks results in more parameter-scalable models. This work thus provides strong empirical evidence towards developing scaling laws for reinforcement learning.
arXiv Detail & Related papers (2024-02-13T17:18:56Z)
Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws [14.546425605156578]
We modify the Chinchilla scaling laws to calculate the optimal LLM parameter count and pre-training data size to train and deploy a model of a given quality and inference demand. We train 47 models of varying sizes and parameter counts to validate our formula and find that model quality continues to improve as we scale tokens per parameter to extreme ranges.
arXiv Detail & Related papers (2023-12-31T10:53:58Z)
The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours. We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length. This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z)
A Survey of Large Language Models [81.06947636926638]
Language modeling has been widely studied for language understanding and generation in the past two decades. Recently, pre-trained language models (PLMs) have been proposed by pre-training Transformer models over large-scale corpora. To discriminate the difference in parameter scale, the research community has coined the term large language models (LLM) for the PLMs of significant size.
arXiv Detail & Related papers (2023-03-31T17:28:46Z)
Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.