A Theory of How Pretraining Shapes Inductive Bias in Fine-Tuning
- URL: http://arxiv.org/abs/2602.20062v1
- Date: Mon, 23 Feb 2026 17:19:33 GMT
- Title: A Theory of How Pretraining Shapes Inductive Bias in Fine-Tuning
- Authors: Nicolas Anguita, Francesco Locatello, Andrew M. Saxe, Marco Mondelli, Flavia Mancini, Samuel Lippl, Clementine Domine,
- Abstract summary: We develop an analytical theory of the pretraining-fine-tuning pipeline in diagonal linear networks.<n>We find that different initialization choices place the network into four distinct fine-tuning regimes.<n>A smaller scale in earlier layers enables the network to both reuse and refine its features, leading to superior generalization.
- Score: 51.505728136705564
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pretraining and fine-tuning are central stages in modern machine learning systems. In practice, feature learning plays an important role across both stages: deep neural networks learn a broad range of useful features during pretraining and further refine those features during fine-tuning. However, an end-to-end theoretical understanding of how choices of initialization impact the ability to reuse and refine features during fine-tuning has remained elusive. Here we develop an analytical theory of the pretraining-fine-tuning pipeline in diagonal linear networks, deriving exact expressions for the generalization error as a function of initialization parameters and task statistics. We find that different initialization choices place the network into four distinct fine-tuning regimes that are distinguished by their ability to support feature learning and reuse, and therefore by the task statistics for which they are beneficial. In particular, a smaller initialization scale in earlier layers enables the network to both reuse and refine its features, leading to superior generalization on fine-tuning tasks that rely on a subset of pretraining features. We demonstrate empirically that the same initialization parameters impact generalization in nonlinear networks trained on CIFAR-100. Overall, our results demonstrate analytically how data and network initialization interact to shape fine-tuning generalization, highlighting an important role for the relative scale of initialization across different layers in enabling continued feature learning during fine-tuning.
Related papers
- Neural network initialization with nonlinear characteristics and information on spectral bias [0.0]
Initialization of neural network parameters, such as weights and biases, has a crucial impact on learning performance.<n>We propose a framework that adjusts the scale factors in the SWIM algorithm to capture low-frequency components in the early-stage hidden layers.
arXiv Detail & Related papers (2025-11-04T04:15:32Z) - Characterising the Inductive Biases of Neural Networks on Boolean Data [0.46180371154032906]
We provide an end-to-end, analytically tractable case study that links a network's inductive prior, its training dynamics including feature learning, and its eventual generalisation.<n>Under a Monte Carlo learning algorithm, our model exhibits predictable training dynamics and the emergence of interpretable features.
arXiv Detail & Related papers (2025-05-29T23:03:33Z) - When Bias Helps Learning: Bridging Initial Prejudice and Trainability [3.9146761527401424]
Mean-field (MF) analyses have demonstrated that the parameter distribution in randomly networks dictates whether gradients vanish or explode.<n>Recent work has shown that untraineds exhibit an initial-guessing bias (IGB), in which large regions of the input space are assigned to a single class.<n>We provide a theoretical proof linking IGB to MF analyses, establishing that a network predisposition toward specific classes is intrinsically tied to the conditions for efficient learning.
arXiv Detail & Related papers (2025-05-17T17:31:56Z) - Where You Place the Norm Matters: From Prejudiced to Neutral Initializations [5.070645558119592]
Normalization layers, such as Batch Normalization and Layer Normalization, are central components in modern neural networks.<n>We study how the presence and placement of normalization within hidden layers influence the statistical properties of network predictions before training begins.<n>Our work provides a principled understanding of how normalization can influence early training behavior and offers guidance for more controlled and interpretable network design.
arXiv Detail & Related papers (2025-05-16T14:38:30Z) - Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $μ$P Parametrization [66.03821840425539]
In this paper, we investigate the training dynamics of $L$-layer neural networks using the tensor gradient program (SGD) framework.<n>We show that SGD enables these networks to learn linearly independent features that substantially deviate from their initial values.<n>This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum.
arXiv Detail & Related papers (2025-03-12T17:33:13Z) - Rethinking Resource Management in Edge Learning: A Joint Pre-training and Fine-tuning Design Paradigm [87.47506806135746]
In some applications, edge learning is experiencing a shift in focusing from conventional learning from scratch to new two-stage learning.
This paper considers the problem of joint communication and computation resource management in a two-stage edge learning system.
It is shown that the proposed joint resource management over the pre-training and fine-tuning stages well balances the system performance trade-off.
arXiv Detail & Related papers (2024-04-01T00:21:11Z) - On the Generalization Ability of Unsupervised Pretraining [53.06175754026037]
Recent advances in unsupervised learning have shown that unsupervised pre-training, followed by fine-tuning, can improve model generalization.
This paper introduces a novel theoretical framework that illuminates the critical factor influencing the transferability of knowledge acquired during unsupervised pre-training to the subsequent fine-tuning phase.
Our results contribute to a better understanding of unsupervised pre-training and fine-tuning paradigm, and can shed light on the design of more effective pre-training algorithms.
arXiv Detail & Related papers (2024-03-11T16:23:42Z) - Initial Guessing Bias: How Untrained Networks Favor Some Classes [0.09103230894909536]
We show that the structure of a deep neural network (DNN) can condition the model to assign all predictions to the same class, even before the beginning of training.
We prove that, besides dataset properties, the presence of this phenomenon is influenced by model choices including dataset preprocessing methods.
We highlight theoretical consequences, such as the breakdown of node-permutation symmetry and the violation of self-averaging.
arXiv Detail & Related papers (2023-06-01T15:37:32Z) - Neural networks trained with SGD learn distributions of increasing
complexity [78.30235086565388]
We show that neural networks trained using gradient descent initially classify their inputs using lower-order input statistics.
We then exploit higher-order statistics only later during training.
We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of universality in learning.
arXiv Detail & Related papers (2022-11-21T15:27:22Z) - Subquadratic Overparameterization for Shallow Neural Networks [60.721751363271146]
We provide an analytical framework that allows us to adopt standard neural training strategies.
We achieve the desiderata viaak-Lojasiewicz, smoothness, and standard assumptions.
arXiv Detail & Related papers (2021-11-02T20:24:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.