A Theory of Initialisation's Impact on Specialisation
- URL: http://arxiv.org/abs/2503.02526v1
- Date: Tue, 04 Mar 2025 11:39:30 GMT
- Title: A Theory of Initialisation's Impact on Specialisation
- Authors: Devon Jarvis, Sebastian Lee, Clémentine Carla Juliette Dominé, Andrew M Saxe, Stefano Sarao Mannelli,
- Abstract summary: We show that weight imbalance and high weight entropy can favour specialised solutions.<n>We then show the emergence of a monotonic relation between task-similarity and forgetting in non-specialised networks.
- Score: 13.486658531315213
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prior work has demonstrated a consistent tendency in neural networks engaged in continual learning tasks, wherein intermediate task similarity results in the highest levels of catastrophic interference. This phenomenon is attributed to the network's tendency to reuse learned features across tasks. However, this explanation heavily relies on the premise that neuron specialisation occurs, i.e. the emergence of localised representations. Our investigation challenges the validity of this assumption. Using theoretical frameworks for the analysis of neural networks, we show a strong dependence of specialisation on the initial condition. More precisely, we show that weight imbalance and high weight entropy can favour specialised solutions. We then apply these insights in the context of continual learning, first showing the emergence of a monotonic relation between task-similarity and forgetting in non-specialised networks. {Finally, we show that specialization by weight imbalance is beneficial on the commonly employed elastic weight consolidation regularisation technique.
Related papers
- Feature contamination: Neural networks learn uncorrelated features and fail to generalize [5.642322814965062]
Learning representations that generalize under distribution shifts is critical for building robust machine learning models.<n>We show that even allowing a neural network to explicitly fit the representations obtained from a teacher network that can generalize out-of-distribution is insufficient for the generalization of the student network.
arXiv Detail & Related papers (2024-06-05T15:04:27Z) - On the Role of Initialization on the Implicit Bias in Deep Linear
Networks [8.272491066698041]
This study focuses on exploring the phenomenon attributed to the implicit bias at play.
Various sources of implicit bias have been identified, such as step size, weight initialization, optimization algorithm, and number of parameters.
arXiv Detail & Related papers (2024-02-04T11:54:07Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - On the generalization of learning algorithms that do not converge [54.122745736433856]
Generalization analyses of deep learning typically assume that the training converges to a fixed point.
Recent results indicate that in practice, the weights of deep neural networks optimized with gradient descent often oscillate indefinitely.
arXiv Detail & Related papers (2022-08-16T21:22:34Z) - How You Start Matters for Generalization [26.74340246715699]
We show that the generalization of neural networks is heavily tied to their initializes.
We make a case against the controversial flat-minima conjecture.
arXiv Detail & Related papers (2022-06-17T05:30:56Z) - Formalizing Generalization and Robustness of Neural Networks to Weight
Perturbations [58.731070632586594]
We provide the first formal analysis for feed-forward neural networks with non-negative monotone activation functions against weight perturbations.
We also design a new theory-driven loss function for training generalizable and robust neural networks against weight perturbations.
arXiv Detail & Related papers (2021-03-03T06:17:03Z) - Gradient Starvation: A Learning Proclivity in Neural Networks [97.02382916372594]
Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task.
This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks.
arXiv Detail & Related papers (2020-11-18T18:52:08Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z) - Understanding Generalization in Deep Learning via Tensor Methods [53.808840694241]
We advance the understanding of the relations between the network's architecture and its generalizability from the compression perspective.
We propose a series of intuitive, data-dependent and easily-measurable properties that tightly characterize the compressibility and generalizability of neural networks.
arXiv Detail & Related papers (2020-01-14T22:26:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.