Revisiting Neural Scaling Laws in Language and Vision
- URL: http://arxiv.org/abs/2209.06640v1
- Date: Tue, 13 Sep 2022 09:41:51 GMT
- Title: Revisiting Neural Scaling Laws in Language and Vision
- Authors: Ibrahim Alabdulmohsin, Behnam Neyshabur, Xiaohua Zhai
- Abstract summary: We argue for a more rigorous methodology based on the extrapolation loss, instead of reporting the best-fitting parameters.
We present a recipe for estimating scaling law parameters reliably from learning curves.
We demonstrate that it extrapolates more accurately than previous methods in a wide range of architecture families across several domains.
- Score: 43.57394336742374
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The remarkable progress in deep learning in recent years is largely driven by
improvements in scale, where bigger models are trained on larger datasets for
longer schedules. To predict the benefit of scale empirically, we argue for a
more rigorous methodology based on the extrapolation loss, instead of reporting
the best-fitting (interpolating) parameters. We then present a recipe for
estimating scaling law parameters reliably from learning curves. We demonstrate
that it extrapolates more accurately than previous methods in a wide range of
architecture families across several domains, including image classification,
neural machine translation (NMT) and language modeling, in addition to tasks
from the BIG-Bench evaluation benchmark. Finally, we release a benchmark
dataset comprising of 90 evaluation tasks to facilitate research in this
domain.
Related papers
- A Bayesian Approach to Data Point Selection [24.98069363998565]
Data point selection (DPS) is becoming a critical topic in deep learning.
Existing approaches to DPS are predominantly based on a bi-level optimisation (BLO) formulation.
We propose a novel Bayesian approach to DPS.
arXiv Detail & Related papers (2024-11-06T09:04:13Z) - A Hitchhiker's Guide to Scaling Law Estimation [56.06982415792523]
Scaling laws predict the loss of a target machine learning model by extrapolating from easier-to-train models with fewer parameters or smaller training sets.
We estimate more than 1000 scaling laws, then derive a set of best practices for estimating scaling laws in new model families.
arXiv Detail & Related papers (2024-10-15T17:59:10Z) - Scaling Laws For Dense Retrieval [22.76001461620846]
We investigate whether the performance of dense retrieval models follows the scaling law as other neural models.
Results indicate that, under our settings, the performance of dense retrieval models follows a precise power-law scaling related to the model size and the number of annotations.
arXiv Detail & Related papers (2024-03-27T15:27:36Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Learning Large-scale Neural Fields via Context Pruned Meta-Learning [60.93679437452872]
We introduce an efficient optimization-based meta-learning technique for large-scale neural field training.
We show how gradient re-scaling at meta-test time allows the learning of extremely high-quality neural fields.
Our framework is model-agnostic, intuitive, straightforward to implement, and shows significant reconstruction improvements for a wide range of signals.
arXiv Detail & Related papers (2023-02-01T17:32:16Z) - Leveraging Angular Information Between Feature and Classifier for
Long-tailed Learning: A Prediction Reformulation Approach [90.77858044524544]
We reformulate the recognition probabilities through included angles without re-balancing the classifier weights.
Inspired by the performance improvement of the predictive form reformulation, we explore the different properties of this angular prediction.
Our method is able to obtain the best performance among peer methods without pretraining on CIFAR10/100-LT and ImageNet-LT.
arXiv Detail & Related papers (2022-12-03T07:52:48Z) - Sequential Learning Of Neural Networks for Prequential MDL [18.475866691786695]
We evaluate approaches for computing prequential description lengths for image classification datasets with neural networks.
Considering the computational cost, we find that online-learning with rehearsal has favorable performance.
We present description lengths for a suite of image classification datasets that improve upon previously reported results by large margins.
arXiv Detail & Related papers (2022-10-14T16:30:23Z) - Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore.
We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z) - Few-shot Learning for Spatial Regression [31.022722103424684]
We propose a few-shot learning method for spatial regression.
Our model is trained using spatial datasets on various attributes in various regions.
In our experiments, we demonstrate that the proposed method achieves better predictive performance than existing meta-learning methods.
arXiv Detail & Related papers (2020-10-09T04:05:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.