Related papers: Fairness-aware Vision Transformer via Debiased Self-Attention

Fairness-aware Vision Transformer via Debiased Self-Attention

URL: http://arxiv.org/abs/2301.13803v3
Date: Thu, 11 Jul 2024 02:11:49 GMT
Title: Fairness-aware Vision Transformer via Debiased Self-Attention
Authors: Yao Qiang, Chengyin Li, Prashant Khanduri, Dongxiao Zhu,
Abstract summary: Debiased Self-Attention (DSA) is a fairness-through-blindness approach that enforces Vision Transformer (ViT) to eliminate spurious features correlated with the sensitive label for bias mitigation. Our framework leads to improved fairness guarantees over prior works on multiple prediction tasks without compromising target prediction performance.
Score: 12.406960223371959
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision Transformer (ViT) has recently gained significant attention in solving computer vision (CV) problems due to its capability of extracting informative features and modeling long-range dependencies through the attention mechanism. Whereas recent works have explored the trustworthiness of ViT, including its robustness and explainability, the issue of fairness has not yet been adequately addressed. We establish that the existing fairness-aware algorithms designed for CNNs do not perform well on ViT, which highlights the need to develop our novel framework via Debiased Self-Attention (DSA). DSA is a fairness-through-blindness approach that enforces ViT to eliminate spurious features correlated with the sensitive label for bias mitigation and simultaneously retain real features for target prediction. Notably, DSA leverages adversarial examples to locate and mask the spurious features in the input image patches with an additional attention weights alignment regularizer in the training objective to encourage learning real features for target prediction. Importantly, our DSA framework leads to improved fairness guarantees over prior works on multiple prediction tasks without compromising target prediction performance. Code is available at \href{https://github.com/qiangyao1988/DSA}{https://github.com/qiangyao1988/DSA}.

Related papers

Improving vulnerability type prediction and line-level detection via adversarial training-based data augmentation and multi-task learning [10.375389754684905]
We propose a unified approach that integrates Embedding-Layer Driven Adversarial Training (EDAT) with Multi-task Learning (MTL)<n>Our proposed approach outperforms state-of-the-art baselines on both Vulnerability Type Prediction (VTP) and Line-level Vulnerability Detection (LVD) tasks.
arXiv Detail & Related papers (2025-06-30T05:47:09Z)
Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration [1.7373859011890633]
Large vision-language models (LVLMs) achieve impressive performance on multimodal tasks but often suffer from hallucination.<n>We introduce the Confidence-Aware Attention framework to address this challenge.
arXiv Detail & Related papers (2025-05-27T17:45:21Z)
Object-Level Verbalized Confidence Calibration in Vision-Language Models via Semantic Perturbation [26.580361841501514]
Vision-language models (VLMs) excel in various multimodal tasks but frequently suffer from poor calibration. This miscalibration undermines user trust, especially when models confidently provide incorrect or fabricated information. We propose a novel Confidence through Semantic Perturbation (CSP) framework to improve the calibration of verbalized confidence for object-centric queries.
arXiv Detail & Related papers (2025-04-21T04:01:22Z)
Adaptive Prompt Tuning: Vision Guided Prompt Tuning with Cross-Attention for Fine-Grained Few-Shot Learning [5.242869847419834]
Few-shot, fine-grained classification in computer vision poses significant challenges due to the need to differentiate subtle class distinctions with limited data. This paper presents a novel method that enhances the Contrastive Language-Image Pre-Training model through adaptive prompt tuning.
arXiv Detail & Related papers (2024-12-19T08:51:01Z)
On the Fairness, Diversity and Reliability of Text-to-Image Generative Models [68.62012304574012]
multimodal generative models have sparked critical discussions on their reliability, fairness and potential for misuse.<n>We propose an evaluation framework to assess model reliability by analyzing responses to global and local perturbations in the embedding space.<n>Our method lays the groundwork for detecting unreliable, bias-injected models and tracing the provenance of embedded biases.
arXiv Detail & Related papers (2024-11-21T09:46:55Z)
Object-Centric Temporal Consistency via Conditional Autoregressive Inductive Biases [69.46487306858789]
Conditional Autoregressive Slot Attention (CA-SA) is a framework that enhances the temporal consistency of extracted object-centric representations in video-centric vision tasks. We present qualitative and quantitative results showing that our proposed method outperforms the considered baselines on downstream tasks.
arXiv Detail & Related papers (2024-10-21T07:44:44Z)
FairViT: Fair Vision Transformer via Adaptive Masking [12.623905443515802]
Vision Transformer (ViT) has achieved excellent performance and demonstrated its promising potential in various computer vision tasks. However, most ViT-based works do not take fairness into account and it is unclear whether directly applying CNN-oriented debiased algorithm to ViT is feasible. We propose FairViT, a novel accurate and fair ViT framework.
arXiv Detail & Related papers (2024-07-20T08:10:37Z)
Uncertainty-boosted Robust Video Activity Anticipation [72.14155465769201]
Video activity anticipation aims to predict what will happen in the future, embracing a broad application prospect ranging from robot vision to autonomous driving. Despite the recent progress, the data uncertainty issue, reflected as the content evolution process and dynamic correlation in event labels, has been somehow ignored. We propose an uncertainty-boosted robust video activity anticipation framework, which generates uncertainty values to indicate the credibility of the anticipation results.
arXiv Detail & Related papers (2024-04-29T12:31:38Z)
Interpretability-Aware Vision Transformer [13.310757078491916]
Vision Transformers (ViTs) have become prominent models for solving various vision tasks. We introduce a novel training procedure that inherently enhances model interpretability. IA-ViT is composed of a feature extractor, a predictor, and an interpreter, which are trained jointly with an interpretability-aware training objective.
arXiv Detail & Related papers (2023-09-14T21:50:49Z)
ARBEx: Attentive Feature Extraction with Reliability Balancing for Robust Facial Expression Learning [5.648318448953635]
ARBEx is a novel attentive feature extraction framework driven by Vision Transformer. We employ learnable anchor points in the embedding space with label distributions and multi-head self-attention mechanism to optimize performance against weak predictions. Our strategy outperforms current state-of-the-art methodologies, according to extensive experiments conducted in a variety of contexts.
arXiv Detail & Related papers (2023-05-02T15:10:01Z)
Top-Down Visual Attention from Analysis by Synthesis [87.47527557366593]
We consider top-down attention from a classic Analysis-by-Synthesis (AbS) perspective of vision. We propose Analysis-by-Synthesis Vision Transformer (AbSViT), which is a top-down modulated ViT model that variationally approximates AbS, and controllable achieves top-down attention.
arXiv Detail & Related papers (2023-03-23T05:17:05Z)
Function Composition in Trustworthy Machine Learning: Implementation Choices, Insights, and Questions [28.643482049799477]
This paper focuses on compositions of functions arising from the different 'pillars' of trustworthiness. We report initial empirical results and new insights on 7 real-world trustworthy dimensions - fairness and explainability. We also report progress, and implementation choices, on an composer tool to encourage the combination of functionalities from multiple pillars.
arXiv Detail & Related papers (2023-02-17T23:49:16Z)
Understanding The Robustness in Vision Transformers [140.1090560977082]
Self-attention may promote robustness through improved mid-level representations. We propose a family of fully attentional networks (FANs) that strengthen this capability. Our model achieves a state of-the-art 87.1% accuracy and 35.8% mCE on ImageNet-1k and ImageNet-C with 76.8M parameters.
arXiv Detail & Related papers (2022-04-26T17:16:32Z)
On Exploring Pose Estimation as an Auxiliary Learning Task for Visible-Infrared Person Re-identification [66.58450185833479]
In this paper, we exploit Pose Estimation as an auxiliary learning task to assist the VI-ReID task in an end-to-end framework. By jointly training these two tasks in a mutually beneficial manner, our model learns higher quality modality-shared and ID-related features. Experimental results on two benchmark VI-ReID datasets show that the proposed method consistently improves state-of-the-art methods by significant margins.
arXiv Detail & Related papers (2022-01-11T09:44:00Z)
Self-Supervision by Prediction for Object Discovery in Videos [62.87145010885044]
In this paper, we use the prediction task as self-supervision and build a novel object-centric model for image sequence representation. Our framework can be trained without the help of any manual annotation or pretrained network. Initial experiments confirm that the proposed pipeline is a promising step towards object-centric video prediction.
arXiv Detail & Related papers (2021-03-09T19:14:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.