Beyond neural scaling laws: beating power law scaling via data pruning
- URL: http://arxiv.org/abs/2206.14486v6
- Date: Fri, 21 Apr 2023 20:14:07 GMT
- Title: Beyond neural scaling laws: beating power law scaling via data pruning
- Authors: Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, Ari S.
Morcos
- Abstract summary: We show how in theory we can break beyond power law scaling and potentially even reduce it to exponential scaling.
We develop a new simple, cheap and scalable self-supervised pruning metric that demonstrates comparable performance to the best supervised metrics.
Overall, our work suggests that the discovery of good data-pruning metrics may provide a viable path forward to substantially improved neural scaling laws.
- Score: 37.804100045519846
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Widely observed neural scaling laws, in which error falls off as a power of
the training set size, model size, or both, have driven substantial performance
improvements in deep learning. However, these improvements through scaling
alone require considerable costs in compute and energy. Here we focus on the
scaling of error with dataset size and show how in theory we can break beyond
power law scaling and potentially even reduce it to exponential scaling instead
if we have access to a high-quality data pruning metric that ranks the order in
which training examples should be discarded to achieve any pruned dataset size.
We then test this improved scaling prediction with pruned dataset size
empirically, and indeed observe better than power law scaling in practice on
ResNets trained on CIFAR-10, SVHN, and ImageNet. Next, given the importance of
finding high-quality pruning metrics, we perform the first large-scale
benchmarking study of ten different data pruning metrics on ImageNet. We find
most existing high performing metrics scale poorly to ImageNet, while the best
are computationally intensive and require labels for every image. We therefore
developed a new simple, cheap and scalable self-supervised pruning metric that
demonstrates comparable performance to the best supervised metrics. Overall,
our work suggests that the discovery of good data-pruning metrics may provide a
viable path forward to substantially improved neural scaling laws, thereby
reducing the resource costs of modern deep learning.
Related papers
- Scaling Law Phenomena Across Regression Paradigms: Multiple and Kernel Approaches [28.569601803576845]
We show that for models with Transformer architecture, the test loss exhibits a power-law relationship with model size, dataset size, and the amount of computation used in training.
Our analysis provides deeper insights into the scaling law, potentially enhancing our understanding of Large Language Models.
arXiv Detail & Related papers (2025-03-03T08:57:49Z) - Information-Theoretic Foundations for Neural Scaling Laws [20.617552198581024]
We develop information-theoretic foundations for neural scaling laws.
We observe that the optimal relation between data and model size is linear, up to logarithmic factors.
arXiv Detail & Related papers (2024-06-28T02:20:54Z) - A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization.
Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z) - Effective pruning of web-scale datasets based on complexity of concept
clusters [48.125618324485195]
We present a method for pruning large-scale multimodal datasets for training CLIP-style models on ImageNet.
We find that training on a smaller set of high-quality data can lead to higher performance with significantly lower training costs.
We achieve a new state-of-the-art Imagehttps://info.arxiv.org/help/prep#commentsNet zero-shot accuracy and a competitive average zero-shot accuracy on 38 evaluation tasks.
arXiv Detail & Related papers (2024-01-09T14:32:24Z) - Reproducible scaling laws for contrastive language-image learning [42.354402731615444]
We investigate scaling laws for contrastive language-image pre-training (CLIP) with the public LAION dataset and the open-source OpenCLIP repository.
Our large-scale experiments involve models trained on up to two billion image-text pairs and identify power law scaling for multiple downstream tasks.
We find that the training distribution plays a key role in scaling laws as the OpenAI and OpenCLIP models exhibit different scaling behavior despite identical model architectures.
arXiv Detail & Related papers (2022-12-14T10:24:50Z) - A Solvable Model of Neural Scaling Laws [72.8349503901712]
Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws.
We propose a statistical model -- a joint generative data model and random feature model -- that captures this neural scaling phenomenology.
Key findings are the manner in which the power laws that occur in the statistics of natural datasets are extended by nonlinear random feature maps.
arXiv Detail & Related papers (2022-10-30T15:13:18Z) - Scaling Laws for a Multi-Agent Reinforcement Learning Model [0.0]
We study performance scaling for a cornerstone reinforcement learning algorithm, AlphaZero.
We find that player strength scales as a power law in neural network parameter count when not bottlenecked by available compute.
We find that the predicted scaling of optimal neural network size fits our data for both games.
arXiv Detail & Related papers (2022-09-29T19:08:51Z) - Revisiting Neural Scaling Laws in Language and Vision [43.57394336742374]
We argue for a more rigorous methodology based on the extrapolation loss, instead of reporting the best-fitting parameters.
We present a recipe for estimating scaling law parameters reliably from learning curves.
We demonstrate that it extrapolates more accurately than previous methods in a wide range of architecture families across several domains.
arXiv Detail & Related papers (2022-09-13T09:41:51Z) - Scaling Laws Under the Microscope: Predicting Transformer Performance
from Small Scale Experiments [42.793379799720434]
We investigate whether scaling laws can be used to accelerate model development.
We find that scaling laws emerge at finetuning time in some NLP tasks.
For tasks where scaling laws exist, they can be used to predict the performance of larger models.
arXiv Detail & Related papers (2022-02-13T19:13:00Z) - Powerpropagation: A sparsity inducing weight reparameterisation [65.85142037667065]
We introduce Powerpropagation, a new weight- parameterisation for neural networks that leads to inherently sparse models.
Models trained in this manner exhibit similar performance, but have a distribution with markedly higher density at zero, allowing more parameters to be pruned safely.
Here, we combine Powerpropagation with a traditional weight-pruning technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing superior performance on the ImageNet benchmark.
arXiv Detail & Related papers (2021-10-01T10:03:57Z) - Top-KAST: Top-K Always Sparse Training [50.05611544535801]
We propose Top-KAST, a method that preserves constant sparsity throughout training.
We show that it performs comparably to or better than previous works when training models on the established ImageNet benchmark.
In addition to our ImageNet results, we also demonstrate our approach in the domain of language modeling.
arXiv Detail & Related papers (2021-06-07T11:13:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.