Learning Diverse Features in Vision Transformers for Improved
Generalization
- URL: http://arxiv.org/abs/2308.16274v1
- Date: Wed, 30 Aug 2023 19:04:34 GMT
- Title: Learning Diverse Features in Vision Transformers for Improved
Generalization
- Authors: Armand Mihai Nicolicioiu, Andrei Liviu Nicolicioiu, Bogdan Alexe,
Damien Teney
- Abstract summary: We show that vision transformers (ViTs) tend to extract robust and spurious features with distinct attention heads.
As a result of this modularity, their performance under distribution shifts can be significantly improved at test time.
We propose a method to further enhance the diversity and complement of the learned features by encouragingarity of the attention heads' input gradients.
- Score: 15.905065768434403
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep learning models often rely only on a small set of features even when
there is a rich set of predictive signals in the training data. This makes
models brittle and sensitive to distribution shifts. In this work, we first
examine vision transformers (ViTs) and find that they tend to extract robust
and spurious features with distinct attention heads. As a result of this
modularity, their performance under distribution shifts can be significantly
improved at test time by pruning heads corresponding to spurious features,
which we demonstrate using an "oracle selection" on validation data. Second, we
propose a method to further enhance the diversity and complementarity of the
learned features by encouraging orthogonality of the attention heads' input
gradients. We observe improved out-of-distribution performance on diagnostic
benchmarks (MNIST-CIFAR, Waterbirds) as a consequence of the enhanced diversity
of features and the pruning of undesirable heads.
Related papers
- Learning Visual Conditioning Tokens to Correct Domain Shift for Fully Test-time Adaptation [24.294049653744185]
In transformer-based image classification, the class token at the first transformer encoder layer can be learned to capture the domain-specific characteristics of target samples during test-time adaptation.
We propose a bi-level learning approach to capture the long-term variations of domain-specific characteristics while accommodating local variations of instance-specific characteristics.
Our proposed bi-level visual conditioning token learning method is able to achieve significantly improved test-time adaptation performance by up to 1.9%.
arXiv Detail & Related papers (2024-06-27T17:16:23Z) - Unsupervised Generative Feature Transformation via Graph Contrastive Pre-training and Multi-objective Fine-tuning [28.673952870674146]
We develop a measurement-pretrain-finetune paradigm for Unsupervised Feature Transformation Learning.
For unsupervised feature set utility measurement, we propose a feature value consistency preservation perspective.
For generative transformation finetuning, we regard a feature set as a feature cross sequence and feature transformation as sequential generation.
arXiv Detail & Related papers (2024-05-27T06:50:00Z) - Mitigating Biases with Diverse Ensembles and Diffusion Models [99.6100669122048]
We propose an ensemble diversification framework exploiting Diffusion Probabilistic Models (DPMs)
We show that DPMs can generate images with novel feature combinations, even when trained on samples displaying correlated input features.
We show that DPM-guided diversification is sufficient to remove dependence on primary shortcut cues, without a need for additional supervised signals.
arXiv Detail & Related papers (2023-11-23T15:47:33Z) - MT-SLVR: Multi-Task Self-Supervised Learning for Transformation
In(Variant) Representations [2.94944680995069]
We propose a multi-task self-supervised framework (MT-SLVR) that learns both variant and invariant features in a parameter-efficient manner.
We evaluate our approach on few-shot classification tasks drawn from a variety of audio domains and demonstrate improved classification performance.
arXiv Detail & Related papers (2023-05-29T09:10:50Z) - Is Self-Supervised Learning More Robust Than Supervised Learning? [29.129681691651637]
Self-supervised contrastive learning is a powerful tool to learn visual representation without labels.
We conduct a series of robustness tests to quantify the behavioral differences between contrastive learning and supervised learning.
Under pre-training corruptions, we find contrastive learning vulnerable to patch shuffling and pixel intensity change, yet less sensitive to dataset-level distribution change.
arXiv Detail & Related papers (2022-06-10T17:58:00Z) - Agree to Disagree: Diversity through Disagreement for Better
Transferability [54.308327969778155]
We propose D-BAT (Diversity-By-disAgreement Training), which enforces agreement among the models on the training data.
We show how D-BAT naturally emerges from the notion of generalized discrepancy.
arXiv Detail & Related papers (2022-02-09T12:03:02Z) - Revisiting Consistency Regularization for Semi-Supervised Learning [80.28461584135967]
We propose an improved consistency regularization framework by a simple yet effective technique, FeatDistLoss.
Experimental results show that our model defines a new state of the art for various datasets and settings.
arXiv Detail & Related papers (2021-12-10T20:46:13Z) - Towards Open-World Feature Extrapolation: An Inductive Graph Learning
Approach [80.8446673089281]
We propose a new learning paradigm with graph representation and learning.
Our framework contains two modules: 1) a backbone network (e.g., feedforward neural nets) as a lower model takes features as input and outputs predicted labels; 2) a graph neural network as an upper model learns to extrapolate embeddings for new features via message passing over a feature-data graph built from observed data.
arXiv Detail & Related papers (2021-10-09T09:02:45Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - Learning What Makes a Difference from Counterfactual Examples and
Gradient Supervision [57.14468881854616]
We propose an auxiliary training objective that improves the generalization capabilities of neural networks.
We use pairs of minimally-different examples with different labels, a.k.a counterfactual or contrasting examples, which provide a signal indicative of the underlying causal structure of the task.
Models trained with this technique demonstrate improved performance on out-of-distribution test sets.
arXiv Detail & Related papers (2020-04-20T02:47:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.