Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Models
- URL: http://arxiv.org/abs/2404.07983v2
- Date: Thu, 10 Oct 2024 17:58:49 GMT
- Title: Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Models
- Authors: Simon Schrodi, David T. Hoffmann, Max Argus, Volker Fischer, Thomas Brox,
- Abstract summary: Contrastive vision-language models (VLMs) have gained popularity for their versatile applicability to various downstream tasks.
Despite their successes in some tasks, like zero-shot object recognition, they perform surprisingly poor on other tasks, like attribute recognition.
Previous work has attributed these challenges to the modality gap, a separation of image and text in the shared representation space, and to a bias towards objects over other factors, such as attributes.
- Score: 27.618704505738425
- License:
- Abstract: Contrastive vision-language models (VLMs), like CLIP, have gained popularity for their versatile applicability to various downstream tasks. Despite their successes in some tasks, like zero-shot object recognition, they perform surprisingly poor on other tasks, like attribute recognition. Previous work has attributed these challenges to the modality gap, a separation of image and text in the shared representation space, and to a bias towards objects over other factors, such as attributes. In this analysis paper, we investigate both phenomena thoroughly. We evaluated off-the-shelf VLMs and find that while the gap's influence on performance is typically overshadowed by other factors, we find indications that closing the gap indeed leads to improvements. Moreover, we find that, contrary to intuition, only few embedding dimensions drive the gap and that the embedding spaces are differently organized. To allow for a clean study of object bias, we introduce a definition and a corresponding measure of it. Equipped with this tool, we find that object bias does not lead to worse performance on other concepts, such as attributes per se. However, why do both phenomena, modality gap and object bias, emerge in the first place? To answer this fundamental question and uncover some of the inner workings of contrastive VLMs, we conducted experiments that allowed us to control the amount of shared information between the modalities. These experiments revealed that the driving factor behind both the modality gap and the object bias, is an information imbalance between images and captions, and unveiled an intriguing connection between the modality gap and entropy of the logits.
Related papers
- Is This the Subspace You Are Looking for? An Interpretability Illusion
for Subspace Activation Patching [47.05588106164043]
Mechanistic interpretability aims to understand model behaviors in terms of specific, interpretable features.
Recent studies have explored subspace interventions as a way to manipulate model behavior and attribute the features behind it to given subspaces.
We demonstrate that these two aims diverge, potentially leading to an illusory sense of interpretability.
arXiv Detail & Related papers (2023-11-28T18:32:19Z) - Identifying Linearly-Mixed Causal Representations from Multi-Node Interventions [14.586959818386765]
We provide the first identifiability result for causal representation learning that allows for multiple variables to be targeted by an intervention within one environment.
Our approach hinges on a general assumption on the coverage and diversity of interventions across environments.
In addition to and inspired by our theoretical contributions, we present a practical algorithm to learn causal representations from multi-node interventional data.
arXiv Detail & Related papers (2023-11-05T16:05:00Z) - Joint Salient Object Detection and Camouflaged Object Detection via
Uncertainty-aware Learning [47.253370009231645]
We introduce an uncertainty-aware learning pipeline to explore the contradictory information of salient object detection (SOD) and camouflaged object detection (COD)
Our solution leads to both state-of-the-art performance and informative uncertainty estimation.
arXiv Detail & Related papers (2023-07-10T15:49:37Z) - Causal Triplet: An Open Challenge for Intervention-centric Causal
Representation Learning [98.78136504619539]
Causal Triplet is a causal representation learning benchmark featuring visually more complex scenes.
We show that models built with the knowledge of disentangled or object-centric representations significantly outperform their distributed counterparts.
arXiv Detail & Related papers (2023-01-12T17:43:38Z) - Chairs Can be Stood on: Overcoming Object Bias in Human-Object
Interaction Detection [22.3445174577181]
Human-Object Interaction (HOI) in images is an important step towards high-level visual comprehension.
We propose a novel plug-and-play Object-wise Debiasing Memory (ODM) method for re-balancing the distribution of interactions under detected objects.
Our method brings consistent and significant improvements over baselines, especially on rare interactions under each object.
arXiv Detail & Related papers (2022-07-06T01:55:28Z) - Exploring the Trade-off between Plausibility, Change Intensity and
Adversarial Power in Counterfactual Explanations using Multi-objective
Optimization [73.89239820192894]
We argue that automated counterfactual generation should regard several aspects of the produced adversarial instances.
We present a novel framework for the generation of counterfactual examples.
arXiv Detail & Related papers (2022-05-20T15:02:53Z) - Suspected Object Matters: Rethinking Model's Prediction for One-stage
Visual Grounding [93.82542533426766]
We propose a Suspected Object Transformation mechanism (SOT) to encourage the target object selection among the suspected ones.
SOT can be seamlessly integrated into existing CNN and Transformer-based one-stage visual grounders.
Extensive experiments demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2022-03-10T06:41:07Z) - On Modality Bias Recognition and Reduction [70.69194431713825]
We study the modality bias problem in the context of multi-modal classification.
We propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned.
Our method yields remarkable performance improvements compared with the baselines.
arXiv Detail & Related papers (2022-02-25T13:47:09Z) - Object-aware Contrastive Learning for Debiased Scene Representation [74.30741492814327]
We develop a novel object-aware contrastive learning framework that localizes objects in a self-supervised manner.
We also introduce two data augmentations based on ContraCAM, object-aware random crop and background mixup, which reduce contextual and background biases during contrastive self-supervised learning.
arXiv Detail & Related papers (2021-07-30T19:24:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.