On the Diminishing Returns of Width for Continual Learning
- URL: http://arxiv.org/abs/2403.06398v3
- Date: Tue, 18 Jun 2024 21:22:10 GMT
- Title: On the Diminishing Returns of Width for Continual Learning
- Authors: Etash Guha, Vihan Lakshman,
- Abstract summary: We analyze Continual Learning Theory to prove that width is directly related to forgetting in Feed-Forward Networks (FFN)
Specifically, we demonstrate that increasing network widths to reduce forgetting yields diminishing returns.
- Score: 2.9301925522760524
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While deep neural networks have demonstrated groundbreaking performance in various settings, these models often suffer from \emph{catastrophic forgetting} when trained on new tasks in sequence. Several works have empirically demonstrated that increasing the width of a neural network leads to a decrease in catastrophic forgetting but have yet to characterize the exact relationship between width and continual learning. We design one of the first frameworks to analyze Continual Learning Theory and prove that width is directly related to forgetting in Feed-Forward Networks (FFN). Specifically, we demonstrate that increasing network widths to reduce forgetting yields diminishing returns. We empirically verify our claims at widths hitherto unexplored in prior studies where the diminishing returns are clearly observed as predicted by our theory.
Related papers
- Why do Learning Rates Transfer? Reconciling Optimization and Scaling
Limits for Deep Learning [77.82908213345864]
We find empirical evidence that learning rate transfer can be attributed to the fact that under $mu$P and its depth extension, the largest eigenvalue of the training loss Hessian is largely independent of the width and depth of the network.
We show that under the neural tangent kernel (NTK) regime, the sharpness exhibits very different dynamics at different scales, thus preventing learning rate transfer.
arXiv Detail & Related papers (2024-02-27T12:28:01Z) - Network Degeneracy as an Indicator of Training Performance: Comparing
Finite and Infinite Width Angle Predictions [3.04585143845864]
We show that as networks get deeper and deeper, they are more susceptible to becoming degenerate.
We use a simple algorithm that can accurately predict the level of degeneracy for any given fully connected ReLU network architecture.
arXiv Detail & Related papers (2023-06-02T13:02:52Z) - Feature-Learning Networks Are Consistent Across Widths At Realistic
Scales [72.27228085606147]
We study the effect of width on the dynamics of feature-learning neural networks across a variety of architectures and datasets.
Early in training, wide neural networks trained on online data have not only identical loss curves but also agree in their point-wise test predictions throughout training.
We observe, however, that ensembles of narrower networks perform worse than a single wide network.
arXiv Detail & Related papers (2023-05-28T17:09:32Z) - Dynamics of Finite Width Kernel and Prediction Fluctuations in Mean
Field Neural Networks [47.73646927060476]
We analyze the dynamics of finite width effects in wide but finite feature learning neural networks.
Our results are non-perturbative in the strength of feature learning.
arXiv Detail & Related papers (2023-04-06T23:11:49Z) - Slimmable Networks for Contrastive Self-supervised Learning [67.21528544724546]
Self-supervised learning makes significant progress in pre-training large models, but struggles with small models.
We present a one-stage solution to obtain pre-trained small models without the need for extra teachers.
A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks.
arXiv Detail & Related papers (2022-09-30T15:15:05Z) - Wide Neural Networks Forget Less Catastrophically [39.907197907411266]
We study the impact of "width" of the neural network architecture on catastrophic forgetting.
We study the learning dynamics of the network from various perspectives.
arXiv Detail & Related papers (2021-10-21T23:49:23Z) - The Limitations of Large Width in Neural Networks: A Deep Gaussian
Process Perspective [34.67386186205545]
This paper decouples capacity and width via the generalization of neural networks to Deep Gaussian Processes (Deep GP)
Surprisingly, we prove that even nonparametric Deep GP converges to Gaussian processes, effectively becoming shallower without any increase in representational power.
We find there is a "sweet spot" that maximizes test set performance before the limiting GP behavior prevents adaptability, occurring at width = 1 or width = 2 for nonparametric Deep GP.
arXiv Detail & Related papers (2021-06-11T17:58:58Z) - Redundant representations help generalization in wide neural networks [71.38860635025907]
We study the last hidden layer representations of various state-of-the-art convolutional neural networks.
We find that if the last hidden representation is wide enough, its neurons tend to split into groups that carry identical information, and differ from each other only by statistically independent noise.
arXiv Detail & Related papers (2021-06-07T10:18:54Z) - A Convergence Theory Towards Practical Over-parameterized Deep Neural
Networks [56.084798078072396]
We take a step towards closing the gap between theory and practice by significantly improving the known theoretical bounds on both the network width and the convergence time.
We show that convergence to a global minimum is guaranteed for networks with quadratic widths in the sample size and linear in their depth at a time logarithmic in both.
Our analysis and convergence bounds are derived via the construction of a surrogate network with fixed activation patterns that can be transformed at any time to an equivalent ReLU network of a reasonable size.
arXiv Detail & Related papers (2021-01-12T00:40:45Z) - Asymptotics of Wide Convolutional Neural Networks [18.198962344790377]
We study scaling laws for wide CNNs and networks with skip connections.
We find that the difference in performance between finite and infinite width models vanishes at a definite rate with respect to model width.
arXiv Detail & Related papers (2020-08-19T21:22:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.