Related papers: Unified Neural Network Scaling Laws and Scale-time Equivalence

Unified Neural Network Scaling Laws and Scale-time Equivalence

URL: http://arxiv.org/abs/2409.05782v1
Date: Mon, 9 Sep 2024 16:45:26 GMT
Title: Unified Neural Network Scaling Laws and Scale-time Equivalence
Authors: Akhilan Boopathy, Ila Fiete,
Abstract summary: We present a novel theoretical characterization of how three factors -- model size, training time, and data volume -- interact to determine the performance of deep neural networks. We first establish a theoretical and empirical equivalence between scaling the size of a neural network and increasing its training time proportionally. We then combine scale-time equivalence with a linear model analysis of double descent to obtain a unified theoretical scaling law.
Score: 10.918504301310753
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As neural networks continue to grow in size but datasets might not, it is vital to understand how much performance improvement can be expected: is it more important to scale network size or data volume? Thus, neural network scaling laws, which characterize how test error varies with network size and data volume, have become increasingly important. However, existing scaling laws are often applicable only in limited regimes and often do not incorporate or predict well-known phenomena such as double descent. Here, we present a novel theoretical characterization of how three factors -- model size, training time, and data volume -- interact to determine the performance of deep neural networks. We first establish a theoretical and empirical equivalence between scaling the size of a neural network and increasing its training time proportionally. Scale-time equivalence challenges the current practice, wherein large models are trained for small durations, and suggests that smaller models trained over extended periods could match their efficacy. It also leads to a novel method for predicting the performance of large-scale networks from small-scale networks trained for extended epochs, and vice versa. We next combine scale-time equivalence with a linear model analysis of double descent to obtain a unified theoretical scaling law, which we confirm with experiments across vision benchmarks and network architectures. These laws explain several previously unexplained phenomena: reduced data requirements for generalization in larger models, heightened sensitivity to label noise in overparameterized models, and instances where increasing model scale does not necessarily enhance performance. Our findings hold significant implications for the practical deployment of neural networks, offering a more accessible and efficient path to training and fine-tuning large models.

Related papers

Beyond Scaling Curves: Internal Dynamics of Neural Networks Through the NTK Lens [0.5745241788717261]
We empirically analyze how neural networks behave under data and model scaling through the lens of the neural tangent kernel (NTK)<n>Our findings of standard vision tasks show that similar performance scaling exponents can occur even though the internal model dynamics show opposite behavior.<n>We also address a previously unresolved issue in neural scaling: how convergence to the infinite-width limit affects scaling behavior in finite-width models.
arXiv Detail & Related papers (2025-07-07T14:17:44Z)
Unlocking the Theory Behind Scaling 1-Bit Neural Networks [7.578147116161996]
1-bit Large Language Models (LLMs) have emerged, showcasing an impressive combination of efficiency and performance that rivals traditional LLMs. We present the first theoretical result that rigorously establishes a Scaling Law for 1-bit Neural Networks. Our findings underscore the promising potential of scaling 1-bit neural networks, suggesting that int1 could become the standard in future neural network precision.
arXiv Detail & Related papers (2024-11-03T19:18:57Z)
Strong Model Collapse [16.071600606637908]
We consider a supervised regression setting and establish the existance of a strong form of the model collapse phenomenon. Our results show that even the smallest fraction of synthetic data can lead to model collapse. We investigate whether increasing model size, an approach aligned with current trends in training large language models, exacerbates or mitigates model collapse.
arXiv Detail & Related papers (2024-10-07T08:54:23Z)
Towards Scalable and Versatile Weight Space Learning [51.78426981947659]
This paper introduces the SANE approach to weight-space learning. Our method extends the idea of hyper-representations towards sequential processing of subsets of neural network weights.
arXiv Detail & Related papers (2024-06-14T13:12:07Z)
A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization. Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z)
A Solvable Model of Neural Scaling Laws [72.8349503901712]
Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws. We propose a statistical model -- a joint generative data model and random feature model -- that captures this neural scaling phenomenology. Key findings are the manner in which the power laws that occur in the statistics of natural datasets are extended by nonlinear random feature maps.
arXiv Detail & Related papers (2022-10-30T15:13:18Z)
Scaling Laws for the Few-Shot Adaptation of Pre-trained Image Classifiers [11.408339220607251]
Empirical science of neural scaling laws is a rapidly growing area of significant importance to the future of machine learning. Our main goal is to investigate how the amount of pre-training data affects the few-shot generalization performance of standard image classifiers.
arXiv Detail & Related papers (2021-10-13T19:07:01Z)
Mitigating Performance Saturation in Neural Marked Point Processes: Architectures and Loss Functions [50.674773358075015]
We propose a simple graph-based network structure called GCHP, which utilizes only graph convolutional layers. We show that GCHP can significantly reduce training time and the likelihood ratio loss with interarrival time probability assumptions can greatly improve the model performance.
arXiv Detail & Related papers (2021-07-07T16:59:14Z)
On Robustness and Transferability of Convolutional Neural Networks [147.71743081671508]
Modern deep convolutional networks (CNNs) are often criticized for not generalizing under distributional shifts. We study the interplay between out-of-distribution and transfer performance of modern image classification CNNs for the first time. We find that increasing both the training set and model sizes significantly improve the distributional shift robustness.
arXiv Detail & Related papers (2020-07-16T18:39:04Z)
The large learning rate phase of deep learning: the catapult mechanism [50.23041928811575]
We present a class of neural networks with solvable training dynamics. We find good agreement between our model's predictions and training dynamics in realistic deep learning settings. We believe our results shed light on characteristics of models trained at different learning rates.
arXiv Detail & Related papers (2020-03-04T17:52:48Z)
Scaling Laws for Neural Language Models [14.472857826717613]
We study scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training.
arXiv Detail & Related papers (2020-01-23T03:59:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.