Improving Reliability of Fine-tuning with Block-wise Optimisation
- URL: http://arxiv.org/abs/2301.06133v1
- Date: Sun, 15 Jan 2023 16:20:18 GMT
- Title: Improving Reliability of Fine-tuning with Block-wise Optimisation
- Authors: Basel Barakat and Qiang Huang
- Abstract summary: Finetuning can be used to tackle domain-specific tasks by transferring knowledge.
We propose a novel block-wise optimization mechanism, which adapts the weights of a group of layers of a pre-trained model.
The proposed approaches are tested on an often-used dataset, Tf_flower.
- Score: 6.83082949264991
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Finetuning can be used to tackle domain-specific tasks by transferring
knowledge. Previous studies on finetuning focused on adapting only the weights
of a task-specific classifier or re-optimizing all layers of the pre-trained
model using the new task data. The first type of methods cannot mitigate the
mismatch between a pre-trained model and the new task data, and the second type
of methods easily cause over-fitting when processing tasks with limited data.
To explore the effectiveness of fine-tuning, we propose a novel block-wise
optimization mechanism, which adapts the weights of a group of layers of a
pre-trained model. In our work, the layer selection can be done in four
different ways. The first is layer-wise adaptation, which aims to search for
the most salient single layer according to the classification performance. The
second way is based on the first one, jointly adapting a small number of
top-ranked layers instead of using an individual layer. The third is block
based segmentation, where the layers of a deep network is segmented into blocks
by non-weighting layers, such as the MaxPooling layer and Activation layer. The
last one is to use a fixed-length sliding window to group layers block by
block. To identify which group of layers is the most suitable for finetuning,
the search starts from the target end and is conducted by freezing other layers
excluding the selected layers and the classification layers. The most salient
group of layers is determined in terms of classification performance. In our
experiments, the proposed approaches are tested on an often-used dataset,
Tf_flower, by finetuning five typical pre-trained models, VGG16, MobileNet-v1,
MobileNet-v2, MobileNet-v3, and ResNet50v2, respectively. The obtained results
show that the use of our proposed block-wise approaches can achieve better
performances than the two baseline methods and the layer-wise method.
Related papers
- LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding [13.747101397628887]
We present an end-to-end solution to speed-up inference of large language models (LLMs)
We apply layer dropout, with low dropout rates for earlier layers and higher dropout rates for later layers, and an early exit loss where all transformer layers share the same exit.
We show that this training recipe increases the accuracy of early exit at earlier layers, without adding any auxiliary layers or modules to the model.
arXiv Detail & Related papers (2024-04-25T16:20:23Z) - Entropy Guided Extrapolative Decoding to Improve Factuality in Large Language Models [55.45444773200529]
Large language models (LLMs) exhibit impressive natural language capabilities but suffer from hallucination.
Recent work has focused on decoding techniques to improve factuality during inference.
arXiv Detail & Related papers (2024-04-14T19:45:35Z) - Layer-wise Linear Mode Connectivity [52.6945036534469]
Averaging neural network parameters is an intuitive method for the knowledge of two independent models.
It is most prominently used in federated learning.
We analyse the performance of the models that result from averaging single, or groups.
arXiv Detail & Related papers (2023-07-13T09:39:10Z) - Learning the Right Layers: a Data-Driven Layer-Aggregation Strategy for
Semi-Supervised Learning on Multilayer Graphs [2.752817022620644]
Clustering (or community detection) on multilayer graphs poses several additional complications.
One of the major challenges is to establish the extent to which each layer contributes to the cluster iteration assignment.
We propose a parameter-free Laplacian-regularized model that learns an optimal nonlinear combination of the different layers from the available input labels.
arXiv Detail & Related papers (2023-05-31T19:50:11Z) - Enhancing Classification with Hierarchical Scalable Query on Fusion
Transformer [0.4129225533930965]
This paper proposes a method to boost fine-grained classification through a hierarchical approach via learnable independent query embeddings.
We exploit the idea of hierarchy to learn query embeddings that are scalable across all levels.
Our method is able to outperform the existing methods with an improvement of 11% at the fine-grained classification.
arXiv Detail & Related papers (2023-02-28T11:00:55Z) - WLD-Reg: A Data-dependent Within-layer Diversity Regularizer [98.78384185493624]
Neural networks are composed of multiple layers arranged in a hierarchical structure jointly trained with a gradient-based optimization.
We propose to complement this traditional 'between-layer' feedback with additional 'within-layer' feedback to encourage the diversity of the activations within the same layer.
We present an extensive empirical study confirming that the proposed approach enhances the performance of several state-of-the-art neural network models in multiple tasks.
arXiv Detail & Related papers (2023-01-03T20:57:22Z) - Surgical Fine-Tuning Improves Adaptation to Distribution Shifts [114.17184775397067]
A common approach to transfer learning under distribution shift is to fine-tune the last few layers of a pre-trained model.
This paper shows that in such settings, selectively fine-tuning a subset of layers matches or outperforms commonly used fine-tuning approaches.
arXiv Detail & Related papers (2022-10-20T17:59:15Z) - Head2Toe: Utilizing Intermediate Representations for Better Transfer
Learning [31.171051511744636]
Transfer-learning methods aim to improve performance in a data-scarce target domain using a model pretrained on a data-rich source domain.
We propose a method, Head-to-Toe probing (Head2Toe), that selects features from all layers of the source model to train a classification head for the target-domain.
arXiv Detail & Related papers (2022-01-10T18:40:07Z) - LV-BERT: Exploiting Layer Variety for BERT [85.27287501885807]
We introduce convolution into the layer type set, which is experimentally found beneficial to pre-trained models.
We then adopt an evolutionary algorithm guided by pre-training accuracy to find the optimal architecture.
LV-BERT model obtained by our method outperforms BERT and its variants on various downstream tasks.
arXiv Detail & Related papers (2021-06-22T13:20:14Z) - Partial Is Better Than All: Revisiting Fine-tuning Strategy for Few-shot
Learning [76.98364915566292]
A common practice is to train a model on the base set first and then transfer to novel classes through fine-tuning.
We propose to transfer partial knowledge by freezing or fine-tuning particular layer(s) in the base model.
We conduct extensive experiments on CUB and mini-ImageNet to demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2021-02-08T03:27:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.