Refining Skewed Perceptions in Vision-Language Models through Visual Representations
- URL: http://arxiv.org/abs/2405.14030v1
- Date: Wed, 22 May 2024 22:03:11 GMT
- Title: Refining Skewed Perceptions in Vision-Language Models through Visual Representations
- Authors: Haocheng Dai, Sarang Joshi,
- Abstract summary: Large vision-language models (VLMs) have become foundational, demonstrating remarkable success across a variety of downstream tasks.
Despite their advantages, these models inherit biases from the disproportionate distribution of real-world data, leading to misconceptions about the actual environment.
This study presents an investigation into how a simple linear probe can effectively distill task-specific core features from CLIP's embedding for downstream applications.
- Score: 0.033483662989441935
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large vision-language models (VLMs), such as CLIP, have become foundational, demonstrating remarkable success across a variety of downstream tasks. Despite their advantages, these models, akin to other foundational systems, inherit biases from the disproportionate distribution of real-world data, leading to misconceptions about the actual environment. Prevalent datasets like ImageNet are often riddled with non-causal, spurious correlations that can diminish VLM performance in scenarios where these contextual elements are absent. This study presents an investigation into how a simple linear probe can effectively distill task-specific core features from CLIP's embedding for downstream applications. Our analysis reveals that the CLIP text representations are often tainted by spurious correlations, inherited in the biased pre-training dataset. Empirical evidence suggests that relying on visual representations from CLIP, as opposed to text embedding, is more practical to refine the skewed perceptions in VLMs, emphasizing the superior utility of visual representations in overcoming embedded biases. Our codes will be available here.
Related papers
- Diffusion Feedback Helps CLIP See Better [40.125318318373715]
Contrastive Language-Image Pre-training (CLIP) excels at abstracting open-world representations across domains and modalities.
CLIP has severe visual shortcomings, such as which can hardly distinguish orientation, quantity, color, structure.
We present a post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process.
arXiv Detail & Related papers (2024-07-29T17:00:09Z) - Unified Lexical Representation for Interpretable Visual-Language Alignment [52.059812317944434]
We introduce LexVLA, a more interpretable VLA framework by learning a unified lexical representation for both modalities without complex design.
We demonstrate that these two pre-trained uni-modal models can be well-aligned by fine-tuning on modest multi-modal dataset.
arXiv Detail & Related papers (2024-07-25T07:35:27Z) - Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing.
Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image.
To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z) - Self-supervised Learning of Contextualized Local Visual Embeddings [0.0]
Contextualized Local Visual Embeddings (CLoVE) is a self-supervised convolutional-based method that learns representations suited for dense prediction tasks.
We benchmark CLoVE's pre-trained representations on multiple datasets.
CLoVE reaches state-of-the-art performance for CNN-based architectures in 4 dense prediction downstream tasks.
arXiv Detail & Related papers (2023-10-01T00:13:06Z) - Hierarchical Visual Primitive Experts for Compositional Zero-Shot
Learning [52.506434446439776]
Compositional zero-shot learning (CZSL) aims to recognize compositions with prior knowledge of known primitives (attribute and object)
We propose a simple and scalable framework called Composition Transformer (CoT) to address these issues.
Our method achieves SoTA performance on several benchmarks, including MIT-States, C-GQA, and VAW-CZSL.
arXiv Detail & Related papers (2023-08-08T03:24:21Z) - APPLeNet: Visual Attention Parameterized Prompt Learning for Few-Shot
Remote Sensing Image Generalization using CLIP [12.73827827842155]
We propose a novel image-conditioned prompt learning strategy called the Visual Attention conditioned Prompts Learning Network (APPLeNet)
APPLeNet emphasizes the importance of multi-scale feature learning in RS scene classification and disentangles visual style and content primitives for domain generalization tasks.
Our results consistently outperform the relevant literature and code is available at https://github.com/mainaksingha01/APPLeNet.
arXiv Detail & Related papers (2023-04-12T17:20:37Z) - Debiasing Vision-Language Models via Biased Prompts [79.04467131711775]
We propose a general approach for debiasing vision-language foundation models by projecting out biased directions in the text embedding.
We show that debiasing only the text embedding with a calibrated projection matrix suffices to yield robust classifiers and fair generative models.
arXiv Detail & Related papers (2023-01-31T20:09:33Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Learning Deep Representations via Contrastive Learning for Instance
Retrieval [11.736450745549792]
This paper makes the first attempt that tackles the problem using instance-discrimination based contrastive learning (CL)
In this work, we approach this problem by exploring the capability of deriving discriminative representations from pre-trained and fine-tuned CL models.
arXiv Detail & Related papers (2022-09-28T04:36:34Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.