Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via
Feature Distillation
- URL: http://arxiv.org/abs/2205.14141v1
- Date: Fri, 27 May 2022 17:59:36 GMT
- Title: Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via
Feature Distillation
- Authors: Yixuan Wei and Han Hu and Zhenda Xie and Zheng Zhang and Yue Cao and
Jianmin Bao and Dong Chen and Baining Guo
- Abstract summary: Masked image modeling (MIM) learns representations with remarkably good fine-tuning performances.
In this paper, we show that the inferior fine-tuning performance of pre-training approaches can be significantly improved by a simple post-processing.
- Score: 42.37533586611174
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Masked image modeling (MIM) learns representations with remarkably good
fine-tuning performances, overshadowing previous prevalent pre-training
approaches such as image classification, instance contrastive learning, and
image-text alignment. In this paper, we show that the inferior fine-tuning
performance of these pre-training approaches can be significantly improved by a
simple post-processing in the form of feature distillation (FD). The feature
distillation converts the old representations to new representations that have
a few desirable properties just like those representations produced by MIM.
These properties, which we aggregately refer to as optimization friendliness,
are identified and analyzed by a set of attention- and optimization-related
diagnosis tools. With these properties, the new representations show strong
fine-tuning performance. Specifically, the contrastive self-supervised learning
methods are made as competitive in fine-tuning as the state-of-the-art masked
image modeling (MIM) algorithms. The CLIP models' fine-tuning performance is
also significantly improved, with a CLIP ViT-L model reaching 89.0% top-1
accuracy on ImageNet-1K classification. More importantly, our work provides a
way for the future research to focus more effort on the generality and
scalability of the learnt representations without being pre-occupied with
optimization friendliness since it can be enhanced rather easily. The code will
be available at https://github.com/SwinTransformer/Feature-Distillation.
Related papers
- Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples.
For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge.
We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z) - Fine-tuning a Multiple Instance Learning Feature Extractor with Masked
Context Modelling and Knowledge Distillation [0.21756081703275998]
We propose to increase downstream MIL classification by fine-tuning the feature extractor model using itMasked Context Modelling with Knowledge Distillation.
A single epoch of the proposed task suffices to increase the downstream performance of the feature-extractor model when used in a MIL scenario, while being considerably smaller and requiring a fraction of its compute.
arXiv Detail & Related papers (2024-03-08T14:04:30Z) - Learning Semantic Proxies from Visual Prompts for Parameter-Efficient Fine-Tuning in Deep Metric Learning [13.964106147449051]
Existing solutions concentrate on fine-tuning the pre-trained models on conventional image datasets.
We propose a novel and effective framework based on learning Visual Prompts (VPT) in the pre-trained Vision Transformers (ViT)
We demonstrate that our new approximations with semantic information are superior to representative capabilities.
arXiv Detail & Related papers (2024-02-04T04:42:05Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - Improving Visual Representation Learning through Perceptual
Understanding [0.0]
We present an extension to masked autoencoders (MAE) which improves on the representations learnt by the model by explicitly encouraging the learning of higher scene-level features.
We achieve 78.1% top-1 accuracy linear probing on ImageNet-1K and up to 88.1% when fine-tuning, with similar results for other downstream tasks.
arXiv Detail & Related papers (2022-12-30T00:59:46Z) - SAGE: Saliency-Guided Mixup with Optimal Rearrangements [22.112463794733188]
Saliency-Guided Mixup with Optimal Rearrangements (SAGE)
SAGE creates new training examples by rearranging and mixing image pairs using visual saliency as guidance.
We demonstrate on CIFAR-10 and CIFAR-100 that SAGE achieves better or comparable performance to the state of the art while being more efficient.
arXiv Detail & Related papers (2022-10-31T19:45:21Z) - Model-Agnostic Multitask Fine-tuning for Few-shot Vision-Language
Transfer Learning [59.38343286807997]
We propose Model-Agnostic Multitask Fine-tuning (MAMF) for vision-language models on unseen tasks.
Compared with model-agnostic meta-learning (MAML), MAMF discards the bi-level optimization and uses only first-order gradients.
We show that MAMF consistently outperforms the classical fine-tuning method for few-shot transfer learning on five benchmark datasets.
arXiv Detail & Related papers (2022-03-09T17:26:53Z) - Image-specific Convolutional Kernel Modulation for Single Image
Super-resolution [85.09413241502209]
In this issue, we propose a novel image-specific convolutional modulation kernel (IKM)
We exploit the global contextual information of image or feature to generate an attention weight for adaptively modulating the convolutional kernels.
Experiments on single image super-resolution show that the proposed methods achieve superior performances over state-of-the-art methods.
arXiv Detail & Related papers (2021-11-16T11:05:10Z) - When Vision Transformers Outperform ResNets without Pretraining or
Strong Data Augmentations [111.44860506703307]
Vision Transformers (ViTs) and existing VisionNets signal efforts on replacing hand-wired features or inductive throughputs with general-purpose neural architectures.
This paper investigates ViTs and Res-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and inference.
We show that the improved robustness attributes to sparser active neurons in the first few layers.
The resultant ViTs outperform Nets of similar size and smoothness when trained from scratch on ImageNet without large-scale pretraining or strong data augmentations.
arXiv Detail & Related papers (2021-06-03T02:08:03Z) - FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning [64.32306537419498]
We propose a novel learned feature-based refinement and augmentation method that produces a varied set of complex transformations.
These transformations also use information from both within-class and across-class representations that we extract through clustering.
We demonstrate that our method is comparable to current state of art for smaller datasets while being able to scale up to larger datasets.
arXiv Detail & Related papers (2020-07-16T17:55:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.