Related papers: Learning Diverse Features in Vision Transformers for Improved Generalization

Learning Diverse Features in Vision Transformers for Improved Generalization

URL: http://arxiv.org/abs/2308.16274v1
Date: Wed, 30 Aug 2023 19:04:34 GMT
Title: Learning Diverse Features in Vision Transformers for Improved Generalization
Authors: Armand Mihai Nicolicioiu, Andrei Liviu Nicolicioiu, Bogdan Alexe, Damien Teney
Abstract summary: We show that vision transformers (ViTs) tend to extract robust and spurious features with distinct attention heads. As a result of this modularity, their performance under distribution shifts can be significantly improved at test time. We propose a method to further enhance the diversity and complement of the learned features by encouragingarity of the attention heads' input gradients.
Score: 15.905065768434403
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep learning models often rely only on a small set of features even when there is a rich set of predictive signals in the training data. This makes models brittle and sensitive to distribution shifts. In this work, we first examine vision transformers (ViTs) and find that they tend to extract robust and spurious features with distinct attention heads. As a result of this modularity, their performance under distribution shifts can be significantly improved at test time by pruning heads corresponding to spurious features, which we demonstrate using an "oracle selection" on validation data. Second, we propose a method to further enhance the diversity and complementarity of the learned features by encouraging orthogonality of the attention heads' input gradients. We observe improved out-of-distribution performance on diagnostic benchmarks (MNIST-CIFAR, Waterbirds) as a consequence of the enhanced diversity of features and the pruning of undesirable heads.

Related papers

Stochastic Encodings for Active Feature Acquisition [100.47043816019888]
Active Feature Acquisition is an instance-wise, sequential decision making problem.<n>The aim is to dynamically select which feature to measure based on current observations, independently for each test instance.<n>Common approaches either use Reinforcement Learning, which experiences training difficulties, or greedily maximize the conditional mutual information of the label and unobserved features, which makes myopic.<n>We introduce a latent variable model, trained in a supervised manner. Acquisitions are made by reasoning about the features across many possible unobserved realizations in a latent space.
arXiv Detail & Related papers (2025-08-03T23:48:46Z)
Bidirectional Prototype-Reward co-Evolution for Test-Time Adaptation of Vision-Language Models [38.63571023556356]
Test-time adaptation (TTA) is crucial in maintaining performance of Vision Language Models (VLMs) when facing distribution shifts.<n>We propose underlineBidirectional Prototype-Reward co-Evolution (BPRE)<n>BPRE integrates feature quality assessment with prototype evolution via a synergistic feedback loop.<n>Our model consistently achieves superior performance compared to other SOTA methods, and advances VLM generalization capabilities.
arXiv Detail & Related papers (2025-03-12T13:40:33Z)
Debiased Prompt Tuning in Vision-Language Model without Annotations [14.811475313694041]
Vision-Language Models (VLMs) may suffer from the problem of spurious correlations. By leveraging pseudo-spurious attribute annotations, we propose a method to automatically adjust the training weights of different groups. Our approach efficiently improves the worst-group accuracy on CelebA, Waterbirds, and MetaShift datasets.
arXiv Detail & Related papers (2025-03-11T12:24:54Z)
VSFormer: Value and Shape-Aware Transformer with Prior-Enhanced Self-Attention for Multivariate Time Series Classification [47.92529531621406]
We propose a novel method, VSFormer, that incorporates both discriminative patterns (shape) and numerical information (value) In addition, we extract class-specific prior information derived from supervised information to enrich the positional encoding. Extensive experiments on all 30 UEA archived datasets demonstrate the superior performance of our method compared to SOTA models.
arXiv Detail & Related papers (2024-12-21T07:31:22Z)
CAVE: Classifying Abnormalities in Video Capsule Endoscopy [0.1937002985471497]
In this study, we explore an ensemble-based approach to improve classification accuracy in complex image datasets. We leverage the unique feature-extraction capabilities of each model to enhance the overall accuracy. Experimental evaluations demonstrate that the ensemble achieves higher accuracy and robustness across challenging and imbalanced classes.
arXiv Detail & Related papers (2024-10-26T17:25:08Z)
Exploring Stronger Transformer Representation Learning for Occluded Person Re-Identification [2.552131151698595]
We proposed a novel self-supervision and supervision combining transformer-based person re-identification framework, namely SSSC-TransReID. We designed a self-supervised contrastive learning branch, which can enhance the feature representation for person re-identification without negative samples or additional pre-training. Our proposed model obtains superior Re-ID performance consistently and outperforms the state-of-the-art ReID methods by large margins on the mean average accuracy (mAP) and Rank-1 accuracy.
arXiv Detail & Related papers (2024-10-21T03:17:25Z)
Feature Augmentation for Self-supervised Contrastive Learning: A Closer Look [28.350278251132078]
We propose a unified framework to conduct data augmentation in the feature space, known as feature augmentation. This strategy is domain-agnostic, which augments similar features to the original ones and thus improves the data diversity.
arXiv Detail & Related papers (2024-10-16T09:25:11Z)
Localized Gaussians as Self-Attention Weights for Point Clouds Correspondence [92.07601770031236]
We investigate semantically meaningful patterns in the attention heads of an encoder-only Transformer architecture. We find that fixing the attention weights not only accelerates the training process but also enhances the stability of the optimization.
arXiv Detail & Related papers (2024-09-20T07:41:47Z)
Unsupervised Generative Feature Transformation via Graph Contrastive Pre-training and Multi-objective Fine-tuning [28.673952870674146]
We develop a measurement-pretrain-finetune paradigm for Unsupervised Feature Transformation Learning. For unsupervised feature set utility measurement, we propose a feature value consistency preservation perspective. For generative transformation finetuning, we regard a feature set as a feature cross sequence and feature transformation as sequential generation.
arXiv Detail & Related papers (2024-05-27T06:50:00Z)
Mitigating Shortcut Learning with Diffusion Counterfactuals and Diverse Ensembles [95.49699178874683]
We propose DiffDiv, an ensemble diversification framework exploiting Diffusion Probabilistic Models (DPMs) We show that DPMs can generate images with novel feature combinations, even when trained on samples displaying correlated input features. We show that DPM-guided diversification is sufficient to remove dependence on shortcut cues, without a need for additional supervised signals.
arXiv Detail & Related papers (2023-11-23T15:47:33Z)
MT-SLVR: Multi-Task Self-Supervised Learning for Transformation In(Variant) Representations [2.94944680995069]
We propose a multi-task self-supervised framework (MT-SLVR) that learns both variant and invariant features in a parameter-efficient manner. We evaluate our approach on few-shot classification tasks drawn from a variety of audio domains and demonstrate improved classification performance.
arXiv Detail & Related papers (2023-05-29T09:10:50Z)
Agree to Disagree: Diversity through Disagreement for Better Transferability [54.308327969778155]
We propose D-BAT (Diversity-By-disAgreement Training), which enforces agreement among the models on the training data. We show how D-BAT naturally emerges from the notion of generalized discrepancy.
arXiv Detail & Related papers (2022-02-09T12:03:02Z)
Revisiting Consistency Regularization for Semi-Supervised Learning [80.28461584135967]
We propose an improved consistency regularization framework by a simple yet effective technique, FeatDistLoss. Experimental results show that our model defines a new state of the art for various datasets and settings.
arXiv Detail & Related papers (2021-12-10T20:46:13Z)
Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples. We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
Learning What Makes a Difference from Counterfactual Examples and Gradient Supervision [57.14468881854616]
We propose an auxiliary training objective that improves the generalization capabilities of neural networks. We use pairs of minimally-different examples with different labels, a.k.a counterfactual or contrasting examples, which provide a signal indicative of the underlying causal structure of the task. Models trained with this technique demonstrate improved performance on out-of-distribution test sets.
arXiv Detail & Related papers (2020-04-20T02:47:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.