Related papers: A Theoretical Analysis of Fine-tuning with Linear Teachers

A Theoretical Analysis of Fine-tuning with Linear Teachers

URL: http://arxiv.org/abs/2107.01641v1
Date: Sun, 4 Jul 2021 14:15:50 GMT
Title: A Theoretical Analysis of Fine-tuning with Linear Teachers
Authors: Gal Shachaf, Alon Brutzkus, Amir Globerson
Abstract summary: Fine-tuning is a common practice in deep learning, achieving excellent results on downstream tasks using relatively little training data. We show that the success of fine-tuning depends on the similarity between the source tasks and the target task, however measuring it is non trivial.
Score: 31.849269592822296
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Fine-tuning is a common practice in deep learning, achieving excellent generalization results on downstream tasks using relatively little training data. Although widely used in practice, it is lacking strong theoretical understanding. We analyze the sample complexity of this scheme for regression with linear teachers in several architectures. Intuitively, the success of fine-tuning depends on the similarity between the source tasks and the target task, however measuring it is non trivial. We show that a relevant measure considers the relation between the source task, the target task and the covariance structure of the target data. In the setting of linear regression, we show that under realistic settings a substantial sample complexity reduction is plausible when the above measure is low. For deep linear regression, we present a novel result regarding the inductive bias of gradient-based training when the network is initialized with pretrained weights. Using this result we show that the similarity measure for this setting is also affected by the depth of the network. We further present results on shallow ReLU models, and analyze the dependence of sample complexity there on source and target tasks. We empirically demonstrate our results for both synthetic and realistic data.

Related papers

Statistically guided deep learning [10.619901778151336]
We present a theoretically well-founded deep learning algorithm for nonparametric regression. We show that a theoretical analysis of deep learning which takes into account simultaneously optimization, generalization and approximation can result in a new deep learning estimate.
arXiv Detail & Related papers (2025-04-11T12:36:06Z)
In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention [52.159541540613915]
We study how multi-head softmax attention models are trained to perform in-context learning on linear data. Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution.
arXiv Detail & Related papers (2025-03-17T02:00:49Z)
Features are fate: a theory of transfer learning in high-dimensional regression [23.840251319669907]
We show that when the target task is well represented by the feature space of the pre-trained model, transfer learning outperforms training from scratch. For this model, we establish rigorously that when the feature space overlap between the source and target tasks is sufficiently strong, both linear transfer and fine-tuning improve performance.
arXiv Detail & Related papers (2024-10-10T17:58:26Z)
Simplicity bias and optimization threshold in two-layer ReLU networks [24.43739371803548]
We show that despite overparametrization, networks converge toward simpler solutions rather than interpolating the training data. Our analysis relies on the so called early alignment phase, during which neurons align towards specific directions.
arXiv Detail & Related papers (2024-10-03T09:58:57Z)
Understanding Forgetting in Continual Learning with Linear Regression [21.8755265936716]
Continual learning, focused on sequentially learning multiple tasks, has gained significant attention recently. We provide a general theoretical analysis of forgetting in the linear regression model via Gradient Descent. We demonstrate that, given a sufficiently large data size, the arrangement of tasks in a sequence, where tasks with larger eigenvalues in their population data covariance matrices are trained later, tends to result in increased forgetting.
arXiv Detail & Related papers (2024-05-27T18:33:37Z)
Theoretical Characterization of the Generalization Performance of Overfitted Meta-Learning [70.52689048213398]
This paper studies the performance of overfitted meta-learning under a linear regression model with Gaussian features. We find new and interesting properties that do not exist in single-task linear regression. Our analysis suggests that benign overfitting is more significant and easier to observe when the noise and the diversity/fluctuation of the ground truth of each training task are large.
arXiv Detail & Related papers (2023-04-09T20:36:13Z)
Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data. We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z)
Training trajectories, mini-batch losses and the curious role of the learning rate [13.848916053916618]
We show that validated gradient descent plays a fundamental role in nearly all applications of deep learning. We propose a simple model and a geometric interpretation that allows to analyze the relationship between the gradients of mini-batches and the full batch. In particular, a very low loss value can be reached just one step of descent with large enough learning rate.
arXiv Detail & Related papers (2023-01-05T21:58:46Z)
An Information-Theoretic Framework for Supervised Learning [22.280001450122175]
We propose a novel information-theoretic framework with its own notions of regret and sample complexity. We study the sample complexity of learning from data generated by deep neural networks with ReLU activation units. We conclude by corroborating our theoretical results with experimental analysis of random single-hidden-layer neural networks.
arXiv Detail & Related papers (2022-03-01T05:58:28Z)
Deep learning: a statistical viewpoint [120.94133818355645]
Deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-perfect solutions to non-optimal training problems. We conjecture that specific principles underlie these phenomena.
arXiv Detail & Related papers (2021-03-16T16:26:36Z)
Estimating informativeness of samples with Smooth Unique Information [108.25192785062367]
We measure how much a sample informs the final weights and how much it informs the function computed by the weights. We give efficient approximations of these quantities using a linearized network. We apply these measures to several problems, such as dataset summarization.
arXiv Detail & Related papers (2021-01-17T10:29:29Z)
Relation-Guided Representation Learning [53.60351496449232]
We propose a new representation learning method that explicitly models and leverages sample relations. Our framework well preserves the relations between samples. By seeking to embed samples into subspace, we show that our method can address the large-scale and out-of-sample problem.
arXiv Detail & Related papers (2020-07-11T10:57:45Z)
Understanding the Effects of Data Parallelism and Sparsity on Neural Network Training [126.49572353148262]
We study two factors in neural network training: data parallelism and sparsity. Despite their promising benefits, understanding of their effects on neural network training remains elusive.
arXiv Detail & Related papers (2020-03-25T10:49:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.