Related papers: Scaling Laws for Deep Learning

Scaling Laws for Deep Learning

URL: http://arxiv.org/abs/2108.07686v1
Date: Tue, 17 Aug 2021 15:37:05 GMT
Title: Scaling Laws for Deep Learning
Authors: Jonathan S. Rosenfeld
Abstract summary: In this thesis we take a systematic approach to address the algorithmic and methodological limitations at the root of these costs. We first demonstrate that deep learning training and pruning are predictable and governed by scaling laws. We then show through the exploration of a noiseless realizable case that DL is in fact dominated by error sources very far from the lower error limit.
Score: 1.90365714903665
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Running faster will only get you so far -- it is generally advisable to first understand where the roads lead, then get a car ... The renaissance of machine learning (ML) and deep learning (DL) over the last decade is accompanied by an unscalable computational cost, limiting its advancement and weighing on the field in practice. In this thesis we take a systematic approach to address the algorithmic and methodological limitations at the root of these costs. We first demonstrate that DL training and pruning are predictable and governed by scaling laws -- for state of the art models and tasks, spanning image classification and language modeling, as well as for state of the art model compression via iterative pruning. Predictability, via the establishment of these scaling laws, provides the path for principled design and trade-off reasoning, currently largely lacking in the field. We then continue to analyze the sources of the scaling laws, offering an approximation-theoretic view and showing through the exploration of a noiseless realizable case that DL is in fact dominated by error sources very far from the lower error limit. We conclude by building on the gained theoretical understanding of the scaling laws' origins. We present a conjectural path to eliminate one of the current dominant error sources -- through a data bandwidth limiting hypothesis and the introduction of Nyquist learners -- which can, in principle, reach the generalization error lower limit (e.g. 0 in the noiseless case), at finite dataset size.

Related papers

Do Larger Language Models Imply Better Reasoning? A Pretraining Scaling Law for Reasoning [89.17086632436363]
We introduce a synthetic multihop reasoning environment designed to replicate the structure and distribution of real-world large-scale knowledge graphs. Our reasoning task involves completing missing edges in the graph, which requires advanced multi-hop reasoning and mimics real-world reasoning scenarios. To predict the optimal model size for a specific knowledge graph, we find an empirical scaling that linearly maps the knowledge graph search entropy to the optimal model size.
arXiv Detail & Related papers (2025-04-04T17:57:22Z)
Scaling Law Phenomena Across Regression Paradigms: Multiple and Kernel Approaches [28.569601803576845]
We show that for models with Transformer architecture, the test loss exhibits a power-law relationship with model size, dataset size, and the amount of computation used in training. Our analysis provides deeper insights into the scaling law, potentially enhancing our understanding of Large Language Models.
arXiv Detail & Related papers (2025-03-03T08:57:49Z)
The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws [51.608402959163925]
We present the first systematic exploration of optimal sparse pre-training configurations for large language models. We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss. We propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training.
arXiv Detail & Related papers (2025-01-21T20:23:22Z)
Has LLM Reached the Scaling Ceiling Yet? Unified Insights into LLM Regularities and Constraints [0.0]
Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their scalability raises a critical question: Have we reached the scaling ceiling? This paper develops a unified theoretical framework that integrates mathematical and statistical insights to explain the scaling dynamics of LLMs. Future progress will require a shift from brute-force scaling to innovations in architecture, data quality, and training paradigms.
arXiv Detail & Related papers (2024-12-21T02:19:07Z)
Bayesian scaling laws for in-context learning [72.17734205418502]
In-context learning (ICL) is a powerful technique for getting language models to perform complex tasks with no training updates. We show that ICL approximates a Bayesian learner and develop a family of novel Bayesian scaling laws for ICL.
arXiv Detail & Related papers (2024-10-21T21:45:22Z)
Knowledge-Aware Parsimony Learning: A Perspective from Relational Graphs [47.6830995661091]
We develop next-generation models in a parsimonious manner, achieving greater potential with simpler models. The key is to drive models using domain-specific knowledge, such as symbols, logic, and formulas, instead of relying on the scaling law. This approach allows us to build a framework that uses this knowledge as "building blocks" to achieve parsimony in model design, training, and interpretation.
arXiv Detail & Related papers (2024-06-29T15:52:37Z)
Information-Theoretic Foundations for Neural Scaling Laws [20.617552198581024]
We develop information-theoretic foundations for neural scaling laws. We observe that the optimal relation between data and model size is linear, up to logarithmic factors.
arXiv Detail & Related papers (2024-06-28T02:20:54Z)
Selecting Large Language Model to Fine-tune via Rectified Scaling Law [74.84096546112215]
Given constrained resources, fine-tuning all models and making selections afterward is unrealistic. We find that the fine-tuning scaling curve includes not just the well-known "power phase" but also the previously unobserved "pre-power phase" By leveraging our law, we propose a novel LLM selection algorithm that selects the near-optimal model with hundreds of times less resource consumption.
arXiv Detail & Related papers (2024-02-04T01:55:00Z)
Predicting Emergent Abilities with Infinite Resolution Evaluation [85.89911520190711]
We introduce PassUntil, an evaluation strategy with theoretically infinite resolution, through massive sampling in the decoding phase. We predict the performance of the 2.4B model on code generation with merely 0.05% deviation before training starts. We identify a kind of accelerated emergence whose scaling curve cannot be fitted by standard scaling law function.
arXiv Detail & Related papers (2023-10-05T02:35:00Z)
Reproducible scaling laws for contrastive language-image learning [42.354402731615444]
We investigate scaling laws for contrastive language-image pre-training (CLIP) with the public LAION dataset and the open-source OpenCLIP repository. Our large-scale experiments involve models trained on up to two billion image-text pairs and identify power law scaling for multiple downstream tasks. We find that the training distribution plays a key role in scaling laws as the OpenAI and OpenCLIP models exhibit different scaling behavior despite identical model architectures.
arXiv Detail & Related papers (2022-12-14T10:24:50Z)
A Solvable Model of Neural Scaling Laws [72.8349503901712]
Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws. We propose a statistical model -- a joint generative data model and random feature model -- that captures this neural scaling phenomenology. Key findings are the manner in which the power laws that occur in the statistics of natural datasets are extended by nonlinear random feature maps.
arXiv Detail & Related papers (2022-10-30T15:13:18Z)
Scaling Laws Beyond Backpropagation [64.0476282000118]
We study the ability of Direct Feedback Alignment to train causal decoder-only Transformers efficiently. We find that DFA fails to offer more efficient scaling than backpropagation.
arXiv Detail & Related papers (2022-10-26T10:09:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.