Distilling Self-Supervised Vision Transformers for Weakly-Supervised
Few-Shot Classification & Segmentation
- URL: http://arxiv.org/abs/2307.03407v1
- Date: Fri, 7 Jul 2023 06:16:43 GMT
- Title: Distilling Self-Supervised Vision Transformers for Weakly-Supervised
Few-Shot Classification & Segmentation
- Authors: Dahyun Kang, Piotr Koniusz, Minsu Cho, Naila Murray
- Abstract summary: We address the task of weakly-supervised few-shot image classification and segmentation, by leveraging a Vision Transformer (ViT)
Our proposed method takes token representations from the self-supervised ViT and leverages their correlations, via self-attention, to produce classification and segmentation predictions.
Experiments on Pascal-5i and COCO-20i demonstrate significant performance gains in a variety of supervision settings.
- Score: 58.03255076119459
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We address the task of weakly-supervised few-shot image classification and
segmentation, by leveraging a Vision Transformer (ViT) pretrained with
self-supervision. Our proposed method takes token representations from the
self-supervised ViT and leverages their correlations, via self-attention, to
produce classification and segmentation predictions through separate task
heads. Our model is able to effectively learn to perform classification and
segmentation in the absence of pixel-level labels during training, using only
image-level labels. To do this it uses attention maps, created from tokens
generated by the self-supervised ViT backbone, as pixel-level pseudo-labels. We
also explore a practical setup with ``mixed" supervision, where a small number
of training images contains ground-truth pixel-level labels and the remaining
images have only image-level labels. For this mixed setup, we propose to
improve the pseudo-labels using a pseudo-label enhancer that was trained using
the available ground-truth pixel-level labels. Experiments on Pascal-5i and
COCO-20i demonstrate significant performance gains in a variety of supervision
settings, and in particular when little-to-no pixel-level labels are available.
Related papers
- Label Filling via Mixed Supervision for Medical Image Segmentation from Noisy Annotations [22.910649758574852]
We propose a simple yet effective Label Filling framework, termed as LF-Net.
It predicts the groundtruth segmentation label given only noisy annotations during training.
Results on five datasets show that our LF-Net boosts segmentation accuracy in all datasets compared with state-of-the-art methods.
arXiv Detail & Related papers (2024-10-21T14:36:36Z) - Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.
We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query.
Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z) - Token-Label Alignment for Vision Transformers [93.58540411138164]
Data mixing strategies (e.g., CutMix) have shown the ability to greatly improve the performance of convolutional neural networks (CNNs)
We identify a token fluctuation phenomenon that has suppressed the potential of data mixing strategies.
We propose a token-label alignment (TL-Align) method to trace the correspondence between transformed tokens and the original tokens to maintain a label for each token.
arXiv Detail & Related papers (2022-10-12T17:54:32Z) - PLMCL: Partial-Label Momentum Curriculum Learning for Multi-Label Image
Classification [25.451065364433028]
Multi-label image classification aims to predict all possible labels in an image.
Existing works on partial-label learning focus on the case where each training image is annotated with only a subset of its labels.
This paper proposes a new partial-label setting in which only a subset of the training images are labeled, each with only one positive label, while the rest of the training images remain unlabeled.
arXiv Detail & Related papers (2022-08-22T01:23:08Z) - Pseudo-label Guided Cross-video Pixel Contrast for Robotic Surgical
Scene Segmentation with Limited Annotations [72.15956198507281]
We propose PGV-CL, a novel pseudo-label guided cross-video contrast learning method to boost scene segmentation.
We extensively evaluate our method on a public robotic surgery dataset EndoVis18 and a public cataract dataset CaDIS.
arXiv Detail & Related papers (2022-07-20T05:42:19Z) - Multiple Instance Learning with Mixed Supervision in Gleason Grading [19.314029297579577]
We propose a mixed supervision Transformer based on the multiple instance learning framework.
The model utilizes both slide-level label and instance-level labels to achieve more accurate Gleason grading at the slide level.
We achieve the state-of-the-art performance on the SICAPv2 dataset, and the visual analysis shows the accurate prediction results of instance level.
arXiv Detail & Related papers (2022-06-26T06:28:44Z) - Mixed Supervision Learning for Whole Slide Image Classification [88.31842052998319]
We propose a mixed supervision learning framework for super high-resolution images.
During the patch training stage, this framework can make use of coarse image-level labels to refine self-supervised learning.
A comprehensive strategy is proposed to suppress pixel-level false positives and false negatives.
arXiv Detail & Related papers (2021-07-02T09:46:06Z) - A Closer Look at Self-training for Zero-Label Semantic Segmentation [53.4488444382874]
Being able to segment unseen classes not observed during training is an important technical challenge in deep learning.
Prior zero-label semantic segmentation works approach this task by learning visual-semantic embeddings or generative models.
We propose a consistency regularizer to filter out noisy pseudo-labels by taking the intersections of the pseudo-labels generated from different augmentations of the same image.
arXiv Detail & Related papers (2021-04-21T14:34:33Z) - Learning from Pixel-Level Label Noise: A New Perspective for
Semi-Supervised Semantic Segmentation [12.937770890847819]
We propose a graph based label noise detection and correction framework to deal with pixel-level noisy labels.
In particular, for the generated pixel-level noisy labels from weak supervisions by Class Activation Map (CAM), we train a clean segmentation model with strong supervisions.
Finally, we adopt a superpixel-based graph to represent the relations of spatial adjacency and semantic similarity between pixels in one image.
arXiv Detail & Related papers (2021-03-26T03:23:21Z) - General Multi-label Image Classification with Transformers [30.58248625606648]
We propose the Classification Transformer (C-Tran) to exploit the complex dependencies among visual features and labels.
A key ingredient of our method is a label mask training objective that uses a ternary encoding scheme to represent the state of the labels.
Our model shows state-of-the-art performance on challenging datasets such as COCO and Visual Genome.
arXiv Detail & Related papers (2020-11-27T23:20:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.