RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models
- URL: http://arxiv.org/abs/2411.04097v1
- Date: Wed, 06 Nov 2024 18:25:00 GMT
- Title: RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models
- Authors: Maya Varma, Jean-Benoit Delbrouck, Zhihong Chen, Akshay Chaudhari, Curtis Langlotz,
- Abstract summary: Fine-tuned vision-language models (VLMs) often capture spurious correlations between image features and textual attributes, resulting in degraded zero-shot performance at test time.
We present RaVL, which takes a fine-grained perspective on VLM by discovering and mitigating spurious correlations using local image features.
- Score: 18.984025219051404
- License:
- Abstract: Fine-tuned vision-language models (VLMs) often capture spurious correlations between image features and textual attributes, resulting in degraded zero-shot performance at test time. Existing approaches for addressing spurious correlations (i) primarily operate at the global image-level rather than intervening directly on fine-grained image features and (ii) are predominantly designed for unimodal settings. In this work, we present RaVL, which takes a fine-grained perspective on VLM robustness by discovering and mitigating spurious correlations using local image features rather than operating at the global image level. Given a fine-tuned VLM, RaVL first discovers spurious correlations by leveraging a region-level clustering approach to identify precise image features contributing to zero-shot classification errors. Then, RaVL mitigates the identified spurious correlation with a novel region-aware loss function that enables the VLM to focus on relevant regions and ignore spurious relationships during fine-tuning. We evaluate RaVL on 654 VLMs with various model architectures, data domains, and learned spurious correlations. Our results show that RaVL accurately discovers (191% improvement over the closest baseline) and mitigates (8.2% improvement on worst-group image classification accuracy) spurious correlations. Qualitative evaluations on general-domain and medical-domain VLMs confirm our findings.
Related papers
- DECIDER: Leveraging Foundation Model Priors for Improved Model Failure Detection and Explanation [18.77296551727931]
We propose DECIDER, a novel approach that leverages priors from large language models (LLMs) and vision-language models (VLMs) to detect failures in image models.
DECIDER consistently achieves state-of-the-art failure detection performance, significantly outperforming baselines in terms of the overall Matthews correlation coefficient.
arXiv Detail & Related papers (2024-08-01T07:08:11Z) - Contrastive Region Guidance: Improving Grounding in Vision-Language
Models without Training [79.27663870280038]
We introduce Contrastive Region Guidance (CRG), a training-free guidance method that enables open-source vision-language models to respond to visual prompts.
When region annotations are provided, CRG increases absolute accuracy by up to 11.1% on ViP-Bench.
We also show CRG's applicability to spatial reasoning, with 10% improvement on What'sUp.
arXiv Detail & Related papers (2024-03-04T18:55:30Z) - ViLLA: Fine-Grained Vision-Language Representation Learning from
Real-World Data [8.905439446173503]
Vision-language models (VLMs) are generally trained on datasets consisting of image-caption pairs obtained from the web.
Real-world multimodal datasets, such as healthcare data, are significantly more complex.
ViLLA is trained to capture fine-grained region-attribute relationships from complex datasets.
arXiv Detail & Related papers (2023-08-22T05:03:09Z) - Debiasing Counterfactuals In the Presence of Spurious Correlations [0.98342301244574]
We introduce the first end-to-end training framework that integrates both (i) popular debiasing classifiers and (ii) counterfactual image generation.
We demonstrate that the debiasing method: learns generalizable markers across the population, and (ii) successfully ignores spurious correlations and focuses on the underlying disease pathology.
arXiv Detail & Related papers (2023-08-21T19:01:45Z) - Adversarial Feature Augmentation and Normalization for Visual
Recognition [109.6834687220478]
Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models.
Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings.
We validate the proposed approach across diverse visual recognition tasks with representative backbone networks.
arXiv Detail & Related papers (2021-03-22T20:36:34Z) - DeVLBert: Learning Deconfounded Visio-Linguistic Representations [111.93480424791613]
We investigate the problem of out-of-domain visio-linguistic pretraining.
Existing methods for this problem are purely likelihood-based.
We propose a Decon-Linguistic Bert framework, abbreviated as DeVLBert, to perform intervention-based learning.
arXiv Detail & Related papers (2020-08-16T11:09:22Z) - Cross-Domain Facial Expression Recognition: A Unified Evaluation
Benchmark and Adversarial Graph Learning [85.6386289476598]
We develop a novel adversarial graph representation adaptation (AGRA) framework for cross-domain holistic-local feature co-adaptation.
We conduct extensive and fair evaluations on several popular benchmarks and show that the proposed AGRA framework outperforms previous state-of-the-art methods.
arXiv Detail & Related papers (2020-08-03T15:00:31Z) - Out-of-distribution Generalization via Partial Feature Decorrelation [72.96261704851683]
We present a novel Partial Feature Decorrelation Learning (PFDL) algorithm, which jointly optimize a feature decomposition network and the target image classification model.
The experiments on real-world datasets demonstrate that our method can improve the backbone model's accuracy on OOD image classification datasets.
arXiv Detail & Related papers (2020-07-30T05:48:48Z) - High-Order Information Matters: Learning Relation and Topology for
Occluded Person Re-Identification [84.43394420267794]
We propose a novel framework by learning high-order relation and topology information for discriminative features and robust alignment.
Our framework significantly outperforms state-of-the-art by6.5%mAP scores on Occluded-Duke dataset.
arXiv Detail & Related papers (2020-03-18T12:18:35Z) - AVR: Attention based Salient Visual Relationship Detection [5.844015313757266]
Visual relationship detection aims to locate objects in images and recognize the relationships between objects.
Traditional methods treat all observed relationships in an image equally, which causes a relatively poor performance in the detection tasks on complex images with abundant visual objects and various relationships.
We propose an attention based model, namely, to achieve salient visual relationships based on both local and global context of the relationships.
arXiv Detail & Related papers (2020-03-16T04:12:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.