Less is More: Selective Layer Finetuning with SubTuning
- URL: http://arxiv.org/abs/2302.06354v3
- Date: Sun, 2 Jul 2023 12:28:46 GMT
- Title: Less is More: Selective Layer Finetuning with SubTuning
- Authors: Gal Kaplun, Andrey Gurevich, Tal Swisa, Mazor David, Shai
Shalev-Shwartz and Eran Malach
- Abstract summary: Finetuning a pretrained model has become a standard approach for training neural networks on novel tasks, resulting in fast convergence and improved performance.
In this work, we study an alternative finetuning method, where instead of finetuning all the weights of the network, we only train a carefully chosen subset of layers, keeping the rest of the weights frozen at their initial (pretrained) values.
We demonstrate that emphsubset finetuning (or SubTuning) often achieves accuracy comparable to full finetuning of the model, and even surpasses the performance of full finetuning when training data is scarce.
- Score: 26.43027780266698
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Finetuning a pretrained model has become a standard approach for training
neural networks on novel tasks, resulting in fast convergence and improved
performance. In this work, we study an alternative finetuning method, where
instead of finetuning all the weights of the network, we only train a carefully
chosen subset of layers, keeping the rest of the weights frozen at their
initial (pretrained) values. We demonstrate that \emph{subset finetuning} (or
SubTuning) often achieves accuracy comparable to full finetuning of the model,
and even surpasses the performance of full finetuning when training data is
scarce. Therefore, SubTuning allows deploying new tasks at minimal
computational cost, while enjoying the benefits of finetuning the entire model.
This yields a simple and effective method for multi-task learning, where
different tasks do not interfere with one another, and yet share most of the
resources at inference time. We demonstrate the efficiency of SubTuning across
multiple tasks, using different network architectures and pretraining methods.
Related papers
- On the Effectiveness of LayerNorm Tuning for Continual Learning in
Vision Transformers [47.77328392236625]
State-of-the-art rehearsal-free continual learning methods exploit the peculiarities of Vision Transformers to learn task-specific prompts.
We introduce a two-stage training procedure, where we first optimize the task-specific parameters and then train the classifier with the same selection procedure of the inference time.
Our method achieves results that are either superior or on par with the state of the art while being computationally cheaper.
arXiv Detail & Related papers (2023-08-18T15:11:16Z) - Surgical Fine-Tuning Improves Adaptation to Distribution Shifts [114.17184775397067]
A common approach to transfer learning under distribution shift is to fine-tune the last few layers of a pre-trained model.
This paper shows that in such settings, selectively fine-tuning a subset of layers matches or outperforms commonly used fine-tuning approaches.
arXiv Detail & Related papers (2022-10-20T17:59:15Z) - Task-Customized Self-Supervised Pre-training with Scalable Dynamic
Routing [76.78772372631623]
A common practice for self-supervised pre-training is to use as much data as possible.
For a specific downstream task, however, involving irrelevant data in pre-training may degenerate the downstream performance.
It is burdensome and infeasible to use different downstream-task-customized datasets in pre-training for different tasks.
arXiv Detail & Related papers (2022-05-26T10:49:43Z) - Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than
In-Context Learning [81.3514358542452]
Few-shot in-context learning (ICL) incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made.
parameter-efficient fine-tuning offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task.
In this paper, we rigorously compare few-shot ICL and parameter-efficient fine-tuning and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs.
arXiv Detail & Related papers (2022-05-11T17:10:41Z) - Task Adaptive Parameter Sharing for Multi-Task Learning [114.80350786535952]
Adaptive Task Adapting Sharing (TAPS) is a method for tuning a base model to a new task by adaptively modifying a small, task-specific subset of layers.
Compared to other methods, TAPS retains high accuracy on downstream tasks while introducing few task-specific parameters.
We evaluate our method on a suite of fine-tuning tasks and architectures (ResNet, DenseNet, ViT) and show that it achieves state-of-the-art performance while being simple to implement.
arXiv Detail & Related papers (2022-03-30T23:16:07Z) - Simultaneous Training of Partially Masked Neural Networks [67.19481956584465]
We show that it is possible to train neural networks in such a way that a predefined 'core' subnetwork can be split-off from the trained full network with remarkable good performance.
We show that training a Transformer with a low-rank core gives a low-rank model with superior performance than when training the low-rank model alone.
arXiv Detail & Related papers (2021-06-16T15:57:51Z) - Investigating Transferability in Pretrained Language Models [8.83046338075119]
We consider a simple ablation technique for determining the impact of each pretrained layer on transfer task performance.
This technique reveals that in BERT, layers with high probing performance on downstream GLUE tasks are neither necessary nor sufficient for high accuracy on those tasks.
arXiv Detail & Related papers (2020-04-30T17:23:19Z) - Side-Tuning: A Baseline for Network Adaptation via Additive Side
Networks [95.51368472949308]
Adaptation can be useful in cases when training data is scarce, or when one wishes to encode priors in the network.
In this paper, we propose a straightforward alternative: side-tuning.
arXiv Detail & Related papers (2019-12-31T18:52:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.