Scaling Laws for Deep Learning
- URL: http://arxiv.org/abs/2108.07686v1
- Date: Tue, 17 Aug 2021 15:37:05 GMT
- Title: Scaling Laws for Deep Learning
- Authors: Jonathan S. Rosenfeld
- Abstract summary: In this thesis we take a systematic approach to address the algorithmic and methodological limitations at the root of these costs.
We first demonstrate that deep learning training and pruning are predictable and governed by scaling laws.
We then show through the exploration of a noiseless realizable case that DL is in fact dominated by error sources very far from the lower error limit.
- Score: 1.90365714903665
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Running faster will only get you so far -- it is generally advisable to first
understand where the roads lead, then get a car ...
The renaissance of machine learning (ML) and deep learning (DL) over the last
decade is accompanied by an unscalable computational cost, limiting its
advancement and weighing on the field in practice. In this thesis we take a
systematic approach to address the algorithmic and methodological limitations
at the root of these costs. We first demonstrate that DL training and pruning
are predictable and governed by scaling laws -- for state of the art models and
tasks, spanning image classification and language modeling, as well as for
state of the art model compression via iterative pruning. Predictability, via
the establishment of these scaling laws, provides the path for principled
design and trade-off reasoning, currently largely lacking in the field. We then
continue to analyze the sources of the scaling laws, offering an
approximation-theoretic view and showing through the exploration of a noiseless
realizable case that DL is in fact dominated by error sources very far from the
lower error limit. We conclude by building on the gained theoretical
understanding of the scaling laws' origins. We present a conjectural path to
eliminate one of the current dominant error sources -- through a data bandwidth
limiting hypothesis and the introduction of Nyquist learners -- which can, in
principle, reach the generalization error lower limit (e.g. 0 in the noiseless
case), at finite dataset size.
Related papers
- Bayesian scaling laws for in-context learning [72.17734205418502]
In-context learning (ICL) is a powerful technique for getting language models to perform complex tasks with no training updates.
We show that ICL approximates a Bayesian learner and develop a family of novel Bayesian scaling laws for ICL.
arXiv Detail & Related papers (2024-10-21T21:45:22Z) - Knowledge-Aware Parsimony Learning: A Perspective from Relational Graphs [47.6830995661091]
We develop next-generation models in a parsimonious manner, achieving greater potential with simpler models.
The key is to drive models using domain-specific knowledge, such as symbols, logic, and formulas, instead of relying on the scaling law.
This approach allows us to build a framework that uses this knowledge as "building blocks" to achieve parsimony in model design, training, and interpretation.
arXiv Detail & Related papers (2024-06-29T15:52:37Z) - Information-Theoretic Foundations for Neural Scaling Laws [20.617552198581024]
We develop information-theoretic foundations for neural scaling laws.
We observe that the optimal relation between data and model size is linear, up to logarithmic factors.
arXiv Detail & Related papers (2024-06-28T02:20:54Z) - Selecting Large Language Model to Fine-tune via Rectified Scaling Law [74.84096546112215]
Given constrained resources, fine-tuning all models and making selections afterward is unrealistic.
We find that the fine-tuning scaling curve includes not just the well-known "power phase" but also the previously unobserved "pre-power phase"
By leveraging our law, we propose a novel LLM selection algorithm that selects the near-optimal model with hundreds of times less resource consumption.
arXiv Detail & Related papers (2024-02-04T01:55:00Z) - Predicting Emergent Abilities with Infinite Resolution Evaluation [85.89911520190711]
We introduce PassUntil, an evaluation strategy with theoretically infinite resolution, through massive sampling in the decoding phase.
We predict the performance of the 2.4B model on code generation with merely 0.05% deviation before training starts.
We identify a kind of accelerated emergence whose scaling curve cannot be fitted by standard scaling law function.
arXiv Detail & Related papers (2023-10-05T02:35:00Z) - Reproducible scaling laws for contrastive language-image learning [42.354402731615444]
We investigate scaling laws for contrastive language-image pre-training (CLIP) with the public LAION dataset and the open-source OpenCLIP repository.
Our large-scale experiments involve models trained on up to two billion image-text pairs and identify power law scaling for multiple downstream tasks.
We find that the training distribution plays a key role in scaling laws as the OpenAI and OpenCLIP models exhibit different scaling behavior despite identical model architectures.
arXiv Detail & Related papers (2022-12-14T10:24:50Z) - A Solvable Model of Neural Scaling Laws [72.8349503901712]
Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws.
We propose a statistical model -- a joint generative data model and random feature model -- that captures this neural scaling phenomenology.
Key findings are the manner in which the power laws that occur in the statistics of natural datasets are extended by nonlinear random feature maps.
arXiv Detail & Related papers (2022-10-30T15:13:18Z) - Scaling Laws Beyond Backpropagation [64.0476282000118]
We study the ability of Direct Feedback Alignment to train causal decoder-only Transformers efficiently.
We find that DFA fails to offer more efficient scaling than backpropagation.
arXiv Detail & Related papers (2022-10-26T10:09:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.