Related papers: Guillotine Regularization: Why removing layers is needed to improve generalization in Self-Supervised Learning

Guillotine Regularization: Why removing layers is needed to improve generalization in Self-Supervised Learning

URL: http://arxiv.org/abs/2206.13378v2
Date: Fri, 9 Jun 2023 14:22:16 GMT
Title: Guillotine Regularization: Why removing layers is needed to improve generalization in Self-Supervised Learning
Authors: Florian Bordes, Randall Balestriero, Quentin Garrido, Adrien Bardes, Pascal Vincent
Abstract summary: Guillotine Regularization (GR) is a generically applicable method that has been used to improve generalization performance in transfer learning scenarios. We identify the underlying reasons behind its success and show that the optimal layer to use might change significantly depending on the training setup, the data or the downstream task.
Score: 15.009986848506486
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: One unexpected technique that emerged in recent years consists in training a Deep Network (DN) with a Self-Supervised Learning (SSL) method, and using this network on downstream tasks but with its last few projector layers entirely removed. This trick of throwing away the projector is actually critical for SSL methods to display competitive performances on ImageNet for which more than 30 percentage points can be gained that way. This is a little vexing, as one would hope that the network layer at which invariance is explicitly enforced by the SSL criterion during training (the last projector layer) should be the one to use for best generalization performance downstream. But it seems not to be, and this study sheds some light on why. This trick, which we name Guillotine Regularization (GR), is in fact a generically applicable method that has been used to improve generalization performance in transfer learning scenarios. In this work, we identify the underlying reasons behind its success and show that the optimal layer to use might change significantly depending on the training setup, the data or the downstream task. Lastly, we give some insights on how to reduce the need for a projector in SSL by aligning the pretext SSL task and the downstream task.

Related papers

Llama SLayer 8B: Shallow Layers Hold the Key to Knowledge Injection [73.06596715100859]
We study the importance of each layer in finding the optimal layer range for knowledge injection. We propose the S strategy, a post-pretraining strategy of selectively enhancing shallow layers while pruning the less effective deep ones. Based on this strategy, we introduce Llama Slayer-8B and Llama Slayer-8B-Instruct.
arXiv Detail & Related papers (2024-10-03T09:28:59Z)
Investigating the Benefits of Projection Head for Representation Learning [11.20245728716827]
An effective technique for obtaining high-quality representations is adding a projection head on top of the encoder during training, then discarding it and using the pre-projection representations. The pre-projection representations are not directly optimized by the loss function, raising the question: what makes them better? We show that implicit bias of training algorithms leads to layer-wise progressive feature weighting, where features become increasingly unequal as we go deeper into the layers.
arXiv Detail & Related papers (2024-03-18T00:48:58Z)
Diffused Redundancy in Pre-trained Representations [98.55546694886819]
We take a closer look at how features are encoded in pre-trained representations. We find that learned representations in a given layer exhibit a degree of diffuse redundancy. Our findings shed light on the nature of representations learned by pre-trained deep neural networks.
arXiv Detail & Related papers (2023-05-31T21:00:50Z)
A surprisingly simple technique to control the pretraining bias for better transfer: Expand or Narrow your representation [22.866948071297767]
Self-Supervised Learning (SSL) models rely on a pretext task to learn representations. We show that merely changing its dimensionality -- by changing only the size of the backbone's very last block -- is a remarkably effective technique to mitigate the pretraining bias.
arXiv Detail & Related papers (2023-04-11T17:24:29Z)
Understanding and Improving the Role of Projection Head in Self-Supervised Learning [77.59320917894043]
Self-supervised learning (SSL) aims to produce useful feature representations without access to human-labeled data annotations. Current contrastive learning approaches append a parametrized projection head to the end of some backbone network to optimize the InfoNCE objective. This raises a fundamental question: Why is a learnable projection head required if we are to discard it after training?
arXiv Detail & Related papers (2022-12-22T05:42:54Z)
Effective Self-supervised Pre-training on Low-compute Networks without Distillation [6.530011859253459]
Reported performance of self-supervised learning has trailed behind standard supervised pre-training by a large margin. Most prior works attribute this poor performance to the capacity bottleneck of the low-compute networks. We take a closer at what are the detrimental factors causing the practical limitations, and whether they are intrinsic to the self-supervised low-compute setting.
arXiv Detail & Related papers (2022-10-06T10:38:07Z)
TSG: Target-Selective Gradient Backprop for Probing CNN Visual Saliency [72.9106103283475]
We study the visual saliency, a.k.a. visual explanation, to interpret convolutional neural networks. Inspired by those observations, we propose a novel visual saliency framework, termed Target-Selective Gradient (TSG) backprop. The proposed TSG consists of two components, namely, TSG-Conv and TSG-FC, which rectify the gradients for convolutional layers and fully-connected layers, respectively.
arXiv Detail & Related papers (2021-10-11T12:00:20Z)
How Self-Supervised Learning Can be Used for Fine-Grained Head Pose Estimation? [2.0625936401496237]
We have tried to answer a question: How SSL can be used for Head Pose estimation? modified versions of jigsaw puzzling and rotation as SSL pre-text tasks are used. The error rate reduced by the HTML method up to 11% compare to the SL.
arXiv Detail & Related papers (2021-08-10T19:34:45Z)
Semantic Drift Compensation for Class-Incremental Learning [48.749630494026086]
Class-incremental learning of deep networks sequentially increases the number of classes to be classified. We propose a new method to estimate the drift, called semantic drift, of features and compensate for it without the need of any exemplars.
arXiv Detail & Related papers (2020-04-01T13:31:19Z)
TAFSSL: Task-Adaptive Feature Sub-Space Learning for few-shot classification [50.358839666165764]
We show that the Task-Adaptive Feature Sub-Space Learning (TAFSSL) can significantly boost the performance in Few-Shot Learning scenarios. Specifically, we show that on the challenging miniImageNet and tieredImageNet benchmarks, TAFSSL can improve the current state-of-the-art in both transductive and semi-supervised FSL settings by more than $5%$.
arXiv Detail & Related papers (2020-03-14T16:59:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.