Related papers: On the Diminishing Returns of Width for Continual Learning

On the Diminishing Returns of Width for Continual Learning

URL: http://arxiv.org/abs/2403.06398v3
Date: Tue, 18 Jun 2024 21:22:10 GMT
Title: On the Diminishing Returns of Width for Continual Learning
Authors: Etash Guha, Vihan Lakshman,
Abstract summary: We analyze Continual Learning Theory to prove that width is directly related to forgetting in Feed-Forward Networks (FFN) Specifically, we demonstrate that increasing network widths to reduce forgetting yields diminishing returns.
Score: 2.9301925522760524
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While deep neural networks have demonstrated groundbreaking performance in various settings, these models often suffer from \emph{catastrophic forgetting} when trained on new tasks in sequence. Several works have empirically demonstrated that increasing the width of a neural network leads to a decrease in catastrophic forgetting but have yet to characterize the exact relationship between width and continual learning. We design one of the first frameworks to analyze Continual Learning Theory and prove that width is directly related to forgetting in Feed-Forward Networks (FFN). Specifically, we demonstrate that increasing network widths to reduce forgetting yields diminishing returns. We empirically verify our claims at widths hitherto unexplored in prior studies where the diminishing returns are clearly observed as predicted by our theory.

Related papers

Make Haste Slowly: A Theory of Emergent Structured Mixed Selectivity in Feature Learning ReLU Networks [16.83151955540625]
We take a step towards a theory of feature learning in finite ReLU networks. We show how structured mixed-selective latent representations can emerge due to a bias for node-reuse and learning speed.
arXiv Detail & Related papers (2025-03-08T11:47:33Z)
Network Degeneracy as an Indicator of Training Performance: Comparing Finite and Infinite Width Angle Predictions [3.04585143845864]
We show that as networks get deeper and deeper, they are more susceptible to becoming degenerate. We use a simple algorithm that can accurately predict the level of degeneracy for any given fully connected ReLU network architecture.
arXiv Detail & Related papers (2023-06-02T13:02:52Z)
Feature-Learning Networks Are Consistent Across Widths At Realistic Scales [72.27228085606147]
We study the effect of width on the dynamics of feature-learning neural networks across a variety of architectures and datasets. Early in training, wide neural networks trained on online data have not only identical loss curves but also agree in their point-wise test predictions throughout training. We observe, however, that ensembles of narrower networks perform worse than a single wide network.
arXiv Detail & Related papers (2023-05-28T17:09:32Z)
Dynamics of Finite Width Kernel and Prediction Fluctuations in Mean Field Neural Networks [47.73646927060476]
We analyze the dynamics of finite width effects in wide but finite feature learning neural networks. Our results are non-perturbative in the strength of feature learning.
arXiv Detail & Related papers (2023-04-06T23:11:49Z)
Wide Neural Networks Forget Less Catastrophically [39.907197907411266]
We study the impact of "width" of the neural network architecture on catastrophic forgetting. We study the learning dynamics of the network from various perspectives.
arXiv Detail & Related papers (2021-10-21T23:49:23Z)
The Limitations of Large Width in Neural Networks: A Deep Gaussian Process Perspective [34.67386186205545]
This paper decouples capacity and width via the generalization of neural networks to Deep Gaussian Processes (Deep GP) Surprisingly, we prove that even nonparametric Deep GP converges to Gaussian processes, effectively becoming shallower without any increase in representational power. We find there is a "sweet spot" that maximizes test set performance before the limiting GP behavior prevents adaptability, occurring at width = 1 or width = 2 for nonparametric Deep GP.
arXiv Detail & Related papers (2021-06-11T17:58:58Z)
Redundant representations help generalization in wide neural networks [71.38860635025907]
We study the last hidden layer representations of various state-of-the-art convolutional neural networks. We find that if the last hidden representation is wide enough, its neurons tend to split into groups that carry identical information, and differ from each other only by statistically independent noise.
arXiv Detail & Related papers (2021-06-07T10:18:54Z)
A Convergence Theory Towards Practical Over-parameterized Deep Neural Networks [56.084798078072396]
We take a step towards closing the gap between theory and practice by significantly improving the known theoretical bounds on both the network width and the convergence time. We show that convergence to a global minimum is guaranteed for networks with quadratic widths in the sample size and linear in their depth at a time logarithmic in both. Our analysis and convergence bounds are derived via the construction of a surrogate network with fixed activation patterns that can be transformed at any time to an equivalent ReLU network of a reasonable size.
arXiv Detail & Related papers (2021-01-12T00:40:45Z)
Asymptotics of Wide Convolutional Neural Networks [18.198962344790377]
We study scaling laws for wide CNNs and networks with skip connections. We find that the difference in performance between finite and infinite width models vanishes at a definite rate with respect to model width.
arXiv Detail & Related papers (2020-08-19T21:22:19Z)
Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks. Centered and ensembled finite networks have reduced posterior variance. Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z)
The large learning rate phase of deep learning: the catapult mechanism [50.23041928811575]
We present a class of neural networks with solvable training dynamics. We find good agreement between our model's predictions and training dynamics in realistic deep learning settings. We believe our results shed light on characteristics of models trained at different learning rates.
arXiv Detail & Related papers (2020-03-04T17:52:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.