Rectify ViT Shortcut Learning by Visual Saliency
- URL: http://arxiv.org/abs/2206.08567v1
- Date: Fri, 17 Jun 2022 05:54:07 GMT
- Title: Rectify ViT Shortcut Learning by Visual Saliency
- Authors: Chong Ma, Lin Zhao, Yuzhong Chen, David Weizhong Liu, Xi Jiang, Tuo
Zhang, Xintao Hu, Dinggang Shen, Dajiang Zhu, Tianming Liu
- Abstract summary: Shortcut learning is common but harmful to deep learning models.
In this work, we propose a novel and effective saliency-guided vision transformer (SGT) model to rectify shortcut learning.
- Score: 40.55418820114868
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Shortcut learning is common but harmful to deep learning models, leading to
degenerated feature representations and consequently jeopardizing the model's
generalizability and interpretability. However, shortcut learning in the widely
used Vision Transformer framework is largely unknown. Meanwhile, introducing
domain-specific knowledge is a major approach to rectifying the shortcuts,
which are predominated by background related factors. For example, in the
medical imaging field, eye-gaze data from radiologists is an effective human
visual prior knowledge that has the great potential to guide the deep learning
models to focus on meaningful foreground regions of interest. However,
obtaining eye-gaze data is time-consuming, labor-intensive and sometimes even
not practical. In this work, we propose a novel and effective saliency-guided
vision transformer (SGT) model to rectify shortcut learning in ViT with the
absence of eye-gaze data. Specifically, a computational visual saliency model
is adopted to predict saliency maps for input image samples. Then, the saliency
maps are used to distil the most informative image patches. In the proposed
SGT, the self-attention among image patches focus only on the distilled
informative ones. Considering this distill operation may lead to global
information lost, we further introduce, in the last encoder layer, a residual
connection that captures the self-attention across all the image patches. The
experiment results on four independent public datasets show that our SGT
framework can effectively learn and leverage human prior knowledge without eye
gaze data and achieves much better performance than baselines. Meanwhile, it
successfully rectifies the harmful shortcut learning and significantly improves
the interpretability of the ViT model, demonstrating the promise of
transferring human prior knowledge derived visual saliency in rectifying
shortcut learning
Related papers
- Exploring the Evolution of Hidden Activations with Live-Update Visualization [12.377279207342735]
We introduce SentryCam, an automated, real-time visualization tool that reveals the progression of hidden representations during training.
Our results show that this visualization offers a more comprehensive view of the learning dynamics compared to basic metrics.
SentryCam could facilitate detailed analysis such as task transfer and catastrophic forgetting to a continual learning setting.
arXiv Detail & Related papers (2024-05-24T01:23:20Z) - Heuristic Vision Pre-Training with Self-Supervised and Supervised
Multi-Task Learning [0.0]
We propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner.
Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks.
arXiv Detail & Related papers (2023-10-11T14:06:04Z) - Recognizing Unseen Objects via Multimodal Intensive Knowledge Graph
Propagation [68.13453771001522]
We propose a multimodal intensive ZSL framework that matches regions of images with corresponding semantic embeddings.
We conduct extensive experiments and evaluate our model on large-scale real-world data.
arXiv Detail & Related papers (2023-06-14T13:07:48Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - EfficientTrain: Exploring Generalized Curriculum Learning for Training
Visual Backbones [80.662250618795]
This paper presents a new curriculum learning approach for the efficient training of visual backbones (e.g., vision Transformers)
As an off-the-shelf method, it reduces the wall-time training cost of a wide variety of popular models by >1.5x on ImageNet-1K/22K without sacrificing accuracy.
arXiv Detail & Related papers (2022-11-17T17:38:55Z) - Eye-gaze-guided Vision Transformer for Rectifying Shortcut Learning [42.674679049746175]
We propose to infuse human experts' intelligence and domain knowledge into the training of deep neural networks.
We propose a novel eye-gaze-guided vision transformer (EG-ViT) for diagnosis with limited medical image data.
arXiv Detail & Related papers (2022-05-25T03:29:10Z) - Self-Promoted Supervision for Few-Shot Transformer [178.52948452353834]
Self-promoted sUpervisioN (SUN) is a few-shot learning framework for vision transformers (ViTs)
SUN pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token.
Experiments show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.
arXiv Detail & Related papers (2022-03-14T12:53:27Z) - CutPaste: Self-Supervised Learning for Anomaly Detection and
Localization [59.719925639875036]
We propose a framework for building anomaly detectors using normal training data only.
We first learn self-supervised deep representations and then build a generative one-class classifier on learned representations.
Our empirical study on MVTec anomaly detection dataset demonstrates the proposed algorithm is general to be able to detect various types of real-world defects.
arXiv Detail & Related papers (2021-04-08T19:04:55Z) - Imitation Learning with Human Eye Gaze via Multi-Objective Prediction [3.5779268406205618]
We propose Gaze Regularized Imitation Learning (GRIL), a novel context-aware imitation learning architecture.
GRIL learns concurrently from both human demonstrations and eye gaze to solve tasks where visual attention provides important context.
We show that GRIL outperforms several state-of-the-art gaze-based imitation learning algorithms, simultaneously learns to predict human visual attention, and generalizes to scenarios not present in the training data.
arXiv Detail & Related papers (2021-02-25T17:13:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.