Related papers: Get rich quick: exact solutions reveal how unbalanced initializations promote rapid feature learning

Get rich quick: exact solutions reveal how unbalanced initializations promote rapid feature learning

URL: http://arxiv.org/abs/2406.06158v2
Date: Sat, 12 Oct 2024 21:38:28 GMT
Title: Get rich quick: exact solutions reveal how unbalanced initializations promote rapid feature learning
Authors: Daniel Kunin, Allan Raventós, Clémentine Dominé, Feng Chen, David Klindt, Andrew Saxe, Surya Ganguli,
Abstract summary: We study how unbalanced layer-specific initialization variances and learning rates determine the degree of feature learning. Our analysis reveals that they conspire to influence the learning regime through a set of conserved quantities. We provide evidence that this unbalanced rich regime drives feature learning in deep finite-width networks, promotes interpretability of early layers in CNNs, reduces the sample complexity of learning hierarchical data, and decreases the time to grokking in modular arithmetic.
Score: 26.07501953088188
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While the impressive performance of modern neural networks is often attributed to their capacity to efficiently extract task-relevant features from data, the mechanisms underlying this rich feature learning regime remain elusive, with much of our theoretical understanding stemming from the opposing lazy regime. In this work, we derive exact solutions to a minimal model that transitions between lazy and rich learning, precisely elucidating how unbalanced layer-specific initialization variances and learning rates determine the degree of feature learning. Our analysis reveals that they conspire to influence the learning regime through a set of conserved quantities that constrain and modify the geometry of learning trajectories in parameter and function space. We extend our analysis to more complex linear models with multiple neurons, outputs, and layers and to shallow nonlinear networks with piecewise linear activation functions. In linear networks, rapid feature learning only occurs from balanced initializations, where all layers learn at similar speeds. While in nonlinear networks, unbalanced initializations that promote faster learning in earlier layers can accelerate rich learning. Through a series of experiments, we provide evidence that this unbalanced rich regime drives feature learning in deep finite-width networks, promotes interpretability of early layers in CNNs, reduces the sample complexity of learning hierarchical data, and decreases the time to grokking in modular arithmetic. Our theory motivates further exploration of unbalanced initializations to enhance efficient feature learning.

Related papers

Dichotomy of Feature Learning and Unlearning: Fast-Slow Analysis on Neural Networks with Stochastic Gradient Descent [5.5165579223151795]
In this study, we consider the infinite-width limit of a two-layer neural network updated with a large-batch gradient, then derive differential equations with different time scales.<n>The direction of a flow on a critical manifold, determined by the slow dynamics, decides whether feature unlearning occurs.<n>Our results yield the following insights: (i) the strength of the primary nonlinear term in data induces the feature unlearning, and (ii) an initial scale of the second-layer weights mitigates the feature unlearning.
arXiv Detail & Related papers (2026-02-07T05:50:44Z)
The Importance of Being Lazy: Scaling Limits of Continual Learning [60.97756735877614]
We show that increasing model width is only beneficial when it reduces the amount of feature learning, yielding more laziness.<n>We study the intricate relationship between feature learning, task non-stationarity, and forgetting, finding that high feature learning is only beneficial with highly similar tasks.
arXiv Detail & Related papers (2025-06-20T10:12:38Z)
Make Haste Slowly: A Theory of Emergent Structured Mixed Selectivity in Feature Learning ReLU Networks [16.83151955540625]
We take a step towards a theory of feature learning in finite ReLU networks. We show how structured mixed-selective latent representations can emerge due to a bias for node-reuse and learning speed.
arXiv Detail & Related papers (2025-03-08T11:47:33Z)
From Lazy to Rich: Exact Learning Dynamics in Deep Linear Networks [47.13391046553908]
In artificial networks, the effectiveness of these models relies on their ability to build task specific representation. Prior studies highlight that different initializations can place networks in either a lazy regime, where representations remain static, or a rich/feature learning regime, where representations evolve dynamically. These solutions capture the evolution of representations and the Neural Kernel across the spectrum from the rich to the lazy regimes.
arXiv Detail & Related papers (2024-09-22T23:19:04Z)
A Unified Framework for Neural Computation and Learning Over Time [56.44910327178975]
Hamiltonian Learning is a novel unified framework for learning with neural networks "over time" It is based on differential equations that: (i) can be integrated without the need of external software solvers; (ii) generalize the well-established notion of gradient-based learning in feed-forward and recurrent networks; (iii) open to novel perspectives.
arXiv Detail & Related papers (2024-09-18T14:57:13Z)
Dynamics of Supervised and Reinforcement Learning in the Non-Linear Perceptron [3.069335774032178]
We use a dataset-process approach to derive flow equations describing learning. We characterize the effects of the learning rule (supervised or reinforcement learning, SL/RL) and input-data distribution on the perceptron's learning curve. This approach points a way toward analyzing learning dynamics for more-complex circuit architectures.
arXiv Detail & Related papers (2024-09-05T17:58:28Z)
How connectivity structure shapes rich and lazy learning in neural circuits [14.236853424595333]
We investigate how the structure of the initial weights -- in particular their effective rank -- influences the network learning regime. Our research highlights the pivotal role of initial weight structures in shaping learning regimes.
arXiv Detail & Related papers (2023-10-12T17:08:45Z)
Critical Learning Periods for Multisensory Integration in Deep Networks [112.40005682521638]
We show that the ability of a neural network to integrate information from diverse sources hinges critically on being exposed to properly correlated signals during the early phases of training. We show that critical periods arise from the complex and unstable early transient dynamics, which are decisive of final performance of the trained system and their learned representations.
arXiv Detail & Related papers (2022-10-06T23:50:38Z)
The Influence of Learning Rule on Representation Dynamics in Wide Neural Networks [18.27510863075184]
We analyze infinite-width deep gradient networks trained with feedback alignment (FA), direct feedback alignment (DFA), and error modulated Hebbian learning (Hebb) We show that, for each of these learning rules, the evolution of the output function at infinite width is governed by a time varying effective neural tangent kernel (eNTK) In the lazy training limit, this eNTK is static and does not evolve, while in the rich mean-field regime this kernel's evolution can be determined self-consistently with dynamical mean field theory (DMFT)
arXiv Detail & Related papers (2022-10-05T11:33:40Z)
Learning Fast and Slow for Online Time Series Forecasting [76.50127663309604]
Fast and Slow learning Networks (FSNet) is a holistic framework for online time-series forecasting. FSNet balances fast adaptation to recent changes and retrieving similar old knowledge. Our code will be made publicly available.
arXiv Detail & Related papers (2022-02-23T18:23:07Z)
Data-driven emergence of convolutional structure in neural networks [83.4920717252233]
We show how fully-connected neural networks solving a discrimination task can learn a convolutional structure directly from their inputs. By carefully designing data models, we show that the emergence of this pattern is triggered by the non-Gaussian, higher-order local structure of the inputs.
arXiv Detail & Related papers (2022-02-01T17:11:13Z)
The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks [43.860358308049044]
In work, we show that these common perceptions can be completely false in the early phase of learning. We argue that this surprising simplicity can persist in networks with more layers with convolutional architecture.
arXiv Detail & Related papers (2020-06-25T17:42:49Z)
The large learning rate phase of deep learning: the catapult mechanism [50.23041928811575]
We present a class of neural networks with solvable training dynamics. We find good agreement between our model's predictions and training dynamics in realistic deep learning settings. We believe our results shed light on characteristics of models trained at different learning rates.
arXiv Detail & Related papers (2020-03-04T17:52:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.