Visual Representation Learning with Self-Supervised Attention for
Low-Label High-data Regime
- URL: http://arxiv.org/abs/2201.08951v1
- Date: Sat, 22 Jan 2022 02:37:07 GMT
- Title: Visual Representation Learning with Self-Supervised Attention for
Low-Label High-data Regime
- Authors: Prarthana Bhattacharyya, Chenge Li, Xiaonan Zhao, Istv\'an
Feh\'erv\'ari and Jason Sun
- Abstract summary: Self-supervised vision transformers (SSL-ViTs) can be adapted to two important computer vision tasks in the low-label, high-data regime.
For few-shot image classification we train SSL-ViTs without any supervision, on external data, and use this trained embedder to adapt quickly to novel classes with limited number of labels.
For zero-shot image retrieval, we use SSL-ViTs pre-trained on a large dataset without any labels and fine-tune them with several metric learning objectives.
- Score: 0.41998444721319217
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-supervision has shown outstanding results for natural language
processing, and more recently, for image recognition. Simultaneously, vision
transformers and its variants have emerged as a promising and scalable
alternative to convolutions on various computer vision tasks. In this paper, we
are the first to question if self-supervised vision transformers (SSL-ViTs) can
be adapted to two important computer vision tasks in the low-label, high-data
regime: few-shot image classification and zero-shot image retrieval. The
motivation is to reduce the number of manual annotations required to train a
visual embedder, and to produce generalizable, semantically meaningful and
robust embeddings. For few-shot image classification we train SSL-ViTs without
any supervision, on external data, and use this trained embedder to adapt
quickly to novel classes with limited number of labels. For zero-shot image
retrieval, we use SSL-ViTs pre-trained on a large dataset without any labels
and fine-tune them with several metric learning objectives. Our self-supervised
attention representations outperforms the state-of-the-art on several public
benchmarks for both tasks, namely miniImageNet and CUB200 for few-shot image
classification by up-to 6%-10%, and Stanford Online Products, Cars196 and
CUB200 for zero-shot image retrieval by up-to 4%-11%. Code is available at
\url{https://github.com/AutoVision-cloud/SSL-ViT-lowlabel-highdata}.
Related papers
- Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples.
For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge.
We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z) - CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement [65.47237619200442]
Contrastive language image pretraining (CLIP) is a standard method for training vision-language models.
We augment CLIP training with task-specific vision models from model zoos to improve its visual representations.
This simple setup shows substantial improvements of up to 16.3% across different vision tasks.
arXiv Detail & Related papers (2023-10-21T20:20:13Z) - MULLER: Multilayer Laplacian Resizer for Vision [16.67232499096539]
We present an extremely lightweight multilayer Laplacian resizer with only a handful of trainable parameters, dubbed MULLER resizer.
We show that MULLER can be easily plugged into various training pipelines, and it effectively boosts the performance of the underlying vision task with little to no extra cost.
arXiv Detail & Related papers (2023-04-06T04:39:21Z) - Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z) - Unified Contrastive Learning in Image-Text-Label Space [130.31947133453406]
Unified Contrastive Learning (UniCL) is effective way of learning semantically rich yet discriminative representations.
UniCL stand-alone is a good learner on pure imagelabel data, rivaling the supervised learning methods across three image classification datasets.
arXiv Detail & Related papers (2022-04-07T17:34:51Z) - BLIP: Bootstrapping Language-Image Pre-training for Unified
Vision-Language Understanding and Generation [86.4572981982407]
We propose BLIP, a new vision-language framework which transfers flexibly to both vision-language understanding and generation tasks.
BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones.
BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner.
arXiv Detail & Related papers (2022-01-28T12:49:48Z) - Multi-label Iterated Learning for Image Classification with Label
Ambiguity [3.5736176624479654]
We propose multi-label iterated learning (MILe) to incorporate the inductive biases of multi-label learning from single labels.
MILe is a simple yet effective procedure that builds a multi-label description of the image by propagating binary predictions.
We show that MILe is effective reducing label noise, achieving state-of-the-art performance on real-world large-scale noisy data such as WebVision.
arXiv Detail & Related papers (2021-11-23T22:10:00Z) - Learning Transferable Visual Models From Natural Language Supervision [13.866297967166089]
Learning directly from raw text about images is a promising alternative.
We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn.
SOTA image representations are learned from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
arXiv Detail & Related papers (2021-02-26T19:04:58Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.