Fully Attentional Networks with Self-emerging Token Labeling
- URL: http://arxiv.org/abs/2401.03844v1
- Date: Mon, 8 Jan 2024 12:14:15 GMT
- Title: Fully Attentional Networks with Self-emerging Token Labeling
- Authors: Bingyin Zhao, Zhiding Yu, Shiyi Lan, Yutao Cheng, Anima Anandkumar,
Yingjie Lao, Jose M. Alvarez
- Abstract summary: We train a FAN token labeler (FAN-TL) to generate semantically meaningful patch token labels, followed by a FAN student model training stage that uses both the token labels and the original class label.
With the proposed STL framework, our best model achieves 84.8% Top-1 accuracy and 42.1% mCE on ImageNet-1K and ImageNet-C, and sets a new state-of-the-art for ImageNet-A (46.1%) and ImageNet-R (56.6%) without using extra data.
- Score: 108.53230681047617
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent studies indicate that Vision Transformers (ViTs) are robust against
out-of-distribution scenarios. In particular, the Fully Attentional Network
(FAN) - a family of ViT backbones, has achieved state-of-the-art robustness. In
this paper, we revisit the FAN models and improve their pre-training with a
self-emerging token labeling (STL) framework. Our method contains a two-stage
training framework. Specifically, we first train a FAN token labeler (FAN-TL)
to generate semantically meaningful patch token labels, followed by a FAN
student model training stage that uses both the token labels and the original
class label. With the proposed STL framework, our best model based on
FAN-L-Hybrid (77.3M parameters) achieves 84.8% Top-1 accuracy and 42.1% mCE on
ImageNet-1K and ImageNet-C, and sets a new state-of-the-art for ImageNet-A
(46.1%) and ImageNet-R (56.6%) without using extra data, outperforming the
original FAN counterpart by significant margins. The proposed framework also
demonstrates significantly enhanced performance on downstream tasks such as
semantic segmentation, with up to 1.7% improvement in robustness over the
counterpart model. Code is available at https://github.com/NVlabs/STL.
Related papers
- SATA: Spatial Autocorrelation Token Analysis for Enhancing the Robustness of Vision Transformers [20.045277771042787]
Vision transformers (ViTs) have consistently demonstrated remarkable performance across various visual recognition tasks.
We introduce a novel approach named Spatial Autocorrelation Token Analysis (SATA) to enhance ViT robustness.
SATA seamlessly integrates into existing ViT baselines without requiring retraining or additional fine-tuning.
arXiv Detail & Related papers (2024-09-30T01:18:40Z) - Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples.
For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge.
We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z) - Target-aware Bi-Transformer for Few-shot Segmentation [4.3753381458828695]
Few-shot semantic segmentation (FSS) aims to use limited labeled support images to identify the segmentation of new classes of objects.
In this paper, we propose the Target-aware Bi-Transformer Network (TBTNet) to equivalent treat of support images and query image.
A vigorous Target-aware Transformer Layer (TTL) also be designed to distill correlations and force the model to focus on foreground information.
arXiv Detail & Related papers (2023-09-18T05:28:51Z) - Prompt Tuning for Parameter-efficient Medical Image Segmentation [79.09285179181225]
We propose and investigate several contributions to achieve a parameter-efficient but effective adaptation for semantic segmentation on two medical imaging datasets.
We pre-train this architecture with a dedicated dense self-supervision scheme based on assignments to online generated prototypes.
We demonstrate that the resulting neural network model is able to attenuate the gap between fully fine-tuned and parameter-efficiently adapted models.
arXiv Detail & Related papers (2022-11-16T21:55:05Z) - Elastic Weight Consolidation Improves the Robustness of Self-Supervised
Learning Methods under Transfer [4.2141621237414615]
Self-supervised representation learning (SSL) methods provide an effective label-free initial condition for fine-tuning downstream tasks.
We re-interpret SSL fine-tuning under the lens of Bayesian continual learning and consider regularization through the Elastic Weight Consolidation (EWC) framework.
We demonstrate that self-regularization against an initial SSL backbone improves worst sub-group performance in Waterbirds by 5% and Celeb-A by 2%.
arXiv Detail & Related papers (2022-10-28T19:00:25Z) - Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z) - Cross-Model Pseudo-Labeling for Semi-Supervised Action Recognition [98.25592165484737]
We propose a more effective pseudo-labeling scheme, called Cross-Model Pseudo-Labeling (CMPL)
CMPL achieves $17.6%$ and $25.1%$ Top-1 accuracy on Kinetics-400 and UCF-101 using only the RGB modality and $1%$ labeled data, respectively.
arXiv Detail & Related papers (2021-12-17T18:59:41Z) - VOLO: Vision Outlooker for Visual Recognition [148.12522298731807]
Vision transformers (ViTs) have shown great potential of self-attention based models in ImageNet classification.
We introduce a novel outlook attention and present a simple and general architecture, termed Vision Outlooker (VOLO)
Unlike self-attention that focuses on global dependency modeling at a coarse level, the outlook attention efficiently encodes finer-level features and contexts into tokens.
Experiments show that our VOLO achieves 87.1% top-1 accuracy on ImageNet-1K classification, which is the first model exceeding 87% accuracy on this competitive benchmark.
arXiv Detail & Related papers (2021-06-24T15:46:54Z) - FixMatch: Simplifying Semi-Supervised Learning with Consistency and
Confidence [93.91751021370638]
Semi-supervised learning (SSL) provides an effective means of leveraging unlabeled data to improve a model's performance.
In this paper, we demonstrate the power of a simple combination of two common SSL methods: consistency regularization and pseudo-labeling.
Our algorithm, FixMatch, first generates pseudo-labels using the model's predictions on weakly-augmented unlabeled images.
arXiv Detail & Related papers (2020-01-21T18:32:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.