Optimizing Relevance Maps of Vision Transformers Improves Robustness
- URL: http://arxiv.org/abs/2206.01161v1
- Date: Thu, 2 Jun 2022 17:24:48 GMT
- Title: Optimizing Relevance Maps of Vision Transformers Improves Robustness
- Authors: Hila Chefer, Idan Schwartz, Lior Wolf
- Abstract summary: It has been observed that visual classification models often rely mostly on the image background, neglecting the foreground, which hurts their robustness to distribution changes.
We propose to monitor the model's relevancy signal and manipulate it such that the model is focused on the foreground object.
This is done as a finetuning step, involving relatively few samples consisting of pairs of images and their associated foreground masks.
- Score: 91.61353418331244
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: It has been observed that visual classification models often rely mostly on
the image background, neglecting the foreground, which hurts their robustness
to distribution changes. To alleviate this shortcoming, we propose to monitor
the model's relevancy signal and manipulate it such that the model is focused
on the foreground object. This is done as a finetuning step, involving
relatively few samples consisting of pairs of images and their associated
foreground masks. Specifically, we encourage the model's relevancy map (i) to
assign lower relevance to background regions, (ii) to consider as much
information as possible from the foreground, and (iii) we encourage the
decisions to have high confidence. When applied to Vision Transformer (ViT)
models, a marked improvement in robustness to domain shifts is observed.
Moreover, the foreground masks can be obtained automatically, from a
self-supervised variant of the ViT model itself; therefore no additional
supervision is required.
Related papers
- Improving Image Clustering with Artifacts Attenuation via Inference-Time Attention Engineering [1.8786950286587742]
As model size increases, high-norm artifacts anomaly appears in the patches of multi-head attention.
We propose Inference-Time Attention Engineering (ITAE) which manipulates attention function during inference.
ITAE shows improved clustering accuracy on multiple datasets by exhibiting more expressive features in latent space.
arXiv Detail & Related papers (2024-10-07T07:26:10Z) - Attention-Guided Masked Autoencoders For Learning Image Representations [16.257915216763692]
Masked autoencoders (MAEs) have established themselves as a powerful method for unsupervised pre-training for computer vision tasks.
We propose to inform the reconstruction process through an attention-guided loss function.
Our evaluations show that our pre-trained models learn better latent representations than the vanilla MAE.
arXiv Detail & Related papers (2024-02-23T08:11:25Z) - Fiducial Focus Augmentation for Facial Landmark Detection [4.433764381081446]
We propose a novel image augmentation technique to enhance the model's understanding of facial structures.
We employ a Siamese architecture-based training mechanism with a Deep Canonical Correlation Analysis (DCCA)-based loss.
Our approach outperforms multiple state-of-the-art approaches across various benchmark datasets.
arXiv Detail & Related papers (2024-02-23T01:34:00Z) - Vision Transformers Need Registers [26.63912173005165]
We identify and characterize artifacts in feature maps of both supervised and self-supervised ViT networks.
We show that this solution fixes that problem entirely for both supervised and self-supervised models.
arXiv Detail & Related papers (2023-09-28T16:45:46Z) - Generalized Face Forgery Detection via Adaptive Learning for Pre-trained Vision Transformer [54.32283739486781]
We present a textbfForgery-aware textbfAdaptive textbfVision textbfTransformer (FA-ViT) under the adaptive learning paradigm.
FA-ViT achieves 93.83% and 78.32% AUC scores on Celeb-DF and DFDC datasets in the cross-dataset evaluation.
arXiv Detail & Related papers (2023-09-20T06:51:11Z) - Advancing Plain Vision Transformer Towards Remote Sensing Foundation
Model [97.9548609175831]
We resort to plain vision transformers with about 100 million parameters and make the first attempt to propose large vision models customized for remote sensing tasks.
Specifically, to handle the large image size and objects of various orientations in RS images, we propose a new rotated varied-size window attention.
Experiments on detection tasks demonstrate the superiority of our model over all state-of-the-art models, achieving 81.16% mAP on the DOTA-V1.0 dataset.
arXiv Detail & Related papers (2022-08-08T09:08:40Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Plug-In Inversion: Model-Agnostic Inversion for Vision with Data
Augmentations [61.95114821573875]
We introduce Plug-In Inversion, which relies on a simple set of augmentations and does not require excessive hyper- parameter tuning.
We illustrate the practicality of our approach by inverting Vision Transformers (ViTs) and Multi-Layer Perceptrons (MLPs) trained on the ImageNet dataset.
arXiv Detail & Related papers (2022-01-31T02:12:45Z) - FoveaTer: Foveated Transformer for Image Classification [8.207403859762044]
We propose foveated Transformer (FoveaTer) model, which uses pooling regions and saccadic movements to perform object classification tasks.
We construct an ensemble model using our proposed model and unfoveated model, achieving an accuracy 1.36% below the unfoveated model with 22% computational savings.
arXiv Detail & Related papers (2021-05-29T01:54:33Z) - Transformer-Based Source-Free Domain Adaptation [134.67078085569017]
We study the task of source-free domain adaptation (SFDA), where the source data are not available during target adaptation.
We propose a generic and effective framework based on Transformer, named TransDA, for learning a generalized model for SFDA.
arXiv Detail & Related papers (2021-05-28T23:06:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.