Related papers: Understanding Linear Probing then Fine-tuning Language Models from NTK Perspective

Understanding Linear Probing then Fine-tuning Language Models from NTK Perspective

URL: http://arxiv.org/abs/2405.16747v1
Date: Mon, 27 May 2024 01:31:40 GMT
Title: Understanding Linear Probing then Fine-tuning Language Models from NTK Perspective
Authors: Akiyoshi Tomihari, Issei Sato,
Abstract summary: We analyze the training dynamics of LP-FT for classification models on the basis of the neural tangent kernel (NTK) theory. We observe a significant increase in the linear head norm during LP, stemming from training with the cross-entropy (CE) loss. Our experiments with a Transformer-based model on natural language processing tasks confirm our theoretical analysis and demonstrate the effectiveness of LP-FT.
Score: 32.01426831450348
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The two-stage fine-tuning (FT) method, linear probing then fine-tuning (LP-FT), consistently outperforms linear probing (LP) and FT alone in terms of accuracy for both in-distribution (ID) and out-of-distribution (OOD) data. This success is largely attributed to the preservation of pre-trained features, achieved through a near-optimal linear head obtained during LP. However, despite the widespread use of large language models, the exploration of complex architectures such as Transformers remains limited. In this paper, we analyze the training dynamics of LP-FT for classification models on the basis of the neural tangent kernel (NTK) theory. Our analysis decomposes the NTK matrix into two components, highlighting the importance of the linear head norm alongside the prediction accuracy at the start of the FT stage. We also observe a significant increase in the linear head norm during LP, stemming from training with the cross-entropy (CE) loss, which effectively minimizes feature changes. Furthermore, we find that this increased norm can adversely affect model calibration, a challenge that can be addressed by temperature scaling. Additionally, we extend our analysis with the NTK to the low-rank adaptation (LoRA) method and validate its effectiveness. Our experiments with a Transformer-based model on natural language processing tasks across multiple benchmarks confirm our theoretical analysis and demonstrate the effectiveness of LP-FT in fine-tuning language models. Code is available at https://github.com/tom4649/lp-ft_ntk.

Related papers

Scaling Law for Stochastic Gradient Descent in Quadratically Parameterized Linear Regression [5.801904710149222]
In machine learning, the scaling law describes how the model performance improves with the model and data size scaling up. This paper studies the scaling law over a linear regression with the model being quadratically parameterized. As a result, in the canonical linear regression, we provide explicit separations for curves between generalization with and without feature learning, and the information-theoretical lower bound that is to parametrization method and the algorithm.
arXiv Detail & Related papers (2025-02-13T09:29:04Z)
Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption. We analyze how magnitude-based models affect generalization while improving adaption. We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z)
Tangent Transformers for Composition, Privacy and Removal [58.280295030852194]
Tangent Attention Fine-Tuning (TAFT) is a method for fine-tuning linearized transformers. Tangent Attention Fine-Tuning (TAFT) is a method for fine-tuning linearized transformers.
arXiv Detail & Related papers (2023-07-16T18:31:25Z)
Robust Learning with Progressive Data Expansion Against Spurious Correlation [65.83104529677234]
We study the learning process of a two-layer nonlinear convolutional neural network in the presence of spurious features. Our analysis suggests that imbalanced data groups and easily learnable spurious features can lead to the dominance of spurious features during the learning process. We propose a new training algorithm called PDE that efficiently enhances the model's robustness for a better worst-group performance.
arXiv Detail & Related papers (2023-06-08T05:44:06Z)
Deep Neural Network Based Accelerated Failure Time Models using Rank Loss [0.0]
An accelerated failure time (AFT) model assumes a log-linear relationship between failure times and a set of covariates. Deep neural networks (DNNs) have received a focal attention over the past decades and have achieved remarkable success in a variety of fields. We propose to apply DNNs in fitting AFT models using a Gehan-type loss, combined with a sub-sampling technique.
arXiv Detail & Related papers (2022-06-13T08:38:18Z)
Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution [100.01469697743322]
Fine-tuning can achieve worse accuracy than linear probing when the pretrained features are good and the distribution shift is large. We show theoretically that this tradeoff between ID and OOD accuracy arises even in a simple setting. Our analysis suggests that the easy two-step strategy of linear probing then full fine-tuning combines the benefits of both fine-tuning and linear probing.
arXiv Detail & Related papers (2022-02-21T09:03:34Z)
Merging Two Cultures: Deep and Statistical Learning [3.15863303008255]
Merging the two cultures of deep and statistical learning provides insights into structured high-dimensional data. We show that prediction, optimisation and uncertainty can be achieved using probabilistic methods at the output layer of the model.
arXiv Detail & Related papers (2021-10-22T02:57:21Z)
Rank-R FNN: A Tensor-Based Learning Model for High-Order Data Classification [69.26747803963907]
Rank-R Feedforward Neural Network (FNN) is a tensor-based nonlinear learning model that imposes Canonical/Polyadic decomposition on its parameters. First, it handles inputs as multilinear arrays, bypassing the need for vectorization, and can thus fully exploit the structural information along every data dimension. We establish the universal approximation and learnability properties of Rank-R FNN, and we validate its performance on real-world hyperspectral datasets.
arXiv Detail & Related papers (2021-04-11T16:37:32Z)
LQF: Linear Quadratic Fine-Tuning [114.3840147070712]
We present the first method for linearizing a pre-trained model that achieves comparable performance to non-linear fine-tuning. LQF consists of simple modifications to the architecture, loss function and optimization typically used for classification.
arXiv Detail & Related papers (2020-12-21T06:40:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.