Scaling Laws for Sparsely-Connected Foundation Models
- URL: http://arxiv.org/abs/2309.08520v1
- Date: Fri, 15 Sep 2023 16:29:27 GMT
- Title: Scaling Laws for Sparsely-Connected Foundation Models
- Authors: Elias Frantar, Carlos Riquelme, Neil Houlsby, Dan Alistarh, Utku Evci
- Abstract summary: We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets.
We identify the first scaling law describing the relationship between weight sparsity, number of non-zero parameters, and amount of training data.
- Score: 70.41266138010657
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We explore the impact of parameter sparsity on the scaling behavior of
Transformers trained on massive datasets (i.e., "foundation models"), in both
vision and language domains. In this setting, we identify the first scaling law
describing the relationship between weight sparsity, number of non-zero
parameters, and amount of training data, which we validate empirically across
model and data scales; on ViT/JFT-4B and T5/C4. These results allow us to
characterize the "optimal sparsity", the sparsity level which yields the best
performance for a given effective model size and training budget. For a fixed
number of non-zero parameters, we identify that the optimal sparsity increases
with the amount of data used for training. We also extend our study to
different sparsity structures (such as the hardware-friendly n:m pattern) and
strategies (such as starting from a pretrained dense model). Our findings shed
light on the power and limitations of weight sparsity across various parameter
and computational settings, offering both theoretical understanding and
practical implications for leveraging sparsity towards computational efficiency
improvements.
Related papers
- Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language
Models [4.114555639014612]
We show the benefits of using unstructured weight sparsity to train only a subset of weights during pre-training.
We demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3 XL model resulting in a 2.5x reduction in pre-training FLOPs.
arXiv Detail & Related papers (2023-03-18T17:56:01Z) - A Solvable Model of Neural Scaling Laws [72.8349503901712]
Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws.
We propose a statistical model -- a joint generative data model and random feature model -- that captures this neural scaling phenomenology.
Key findings are the manner in which the power laws that occur in the statistics of natural datasets are extended by nonlinear random feature maps.
arXiv Detail & Related papers (2022-10-30T15:13:18Z) - Parameter-Efficient Sparsity for Large Language Models Fine-Tuning [63.321205487234074]
We propose a.
sparse-efficient Sparse Training (PST) method to reduce the number of trainable parameters during sparse-aware training.
Experiments with diverse networks (i.e., BERT, RoBERTa and GPT-2) demonstrate PST performs on par or better than previous sparsity methods.
arXiv Detail & Related papers (2022-05-23T02:43:45Z) - Powerpropagation: A sparsity inducing weight reparameterisation [65.85142037667065]
We introduce Powerpropagation, a new weight- parameterisation for neural networks that leads to inherently sparse models.
Models trained in this manner exhibit similar performance, but have a distribution with markedly higher density at zero, allowing more parameters to be pruned safely.
Here, we combine Powerpropagation with a traditional weight-pruning technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing superior performance on the ImageNet benchmark.
arXiv Detail & Related papers (2021-10-01T10:03:57Z) - Scaling Laws for Transfer [0.5432984841650929]
We study scaling laws for transfer learning between distributions in an unsupervised, fine-tuning setting.
We find that the effective data transferred is described well in the low data regime by a power-law of parameter count and fine-tuning dataset size.
arXiv Detail & Related papers (2021-02-02T04:07:38Z) - Understanding the Effects of Data Parallelism and Sparsity on Neural
Network Training [126.49572353148262]
We study two factors in neural network training: data parallelism and sparsity.
Despite their promising benefits, understanding of their effects on neural network training remains elusive.
arXiv Detail & Related papers (2020-03-25T10:49:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.