Learn "No" to Say "Yes" Better: Improving Vision-Language Models via Negations
- URL: http://arxiv.org/abs/2403.20312v1
- Date: Fri, 29 Mar 2024 17:33:42 GMT
- Title: Learn "No" to Say "Yes" Better: Improving Vision-Language Models via Negations
- Authors: Jaisidh Singh, Ishaan Shrivastava, Mayank Vatsa, Richa Singh, Aparna Bharati,
- Abstract summary: Existing vision-language models (VLMs) treat text descriptions as a unit, confusing individual concepts in a prompt.
We present CC-Neg, a dataset containing 228,246 images, true captions and their corresponding negated captions.
Using CC-Neg along with modifications to the contrastive loss of CLIP, our proposed CoN-CLIP framework, has an improved understanding of negations.
- Score: 43.484570564890866
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing vision-language models (VLMs) treat text descriptions as a unit, confusing individual concepts in a prompt and impairing visual semantic matching and reasoning. An important aspect of reasoning in logic and language is negations. This paper highlights the limitations of popular VLMs such as CLIP, at understanding the implications of negations, i.e., the effect of the word "not" in a given prompt. To enable evaluation of VLMs on fluent prompts with negations, we present CC-Neg, a dataset containing 228,246 images, true captions and their corresponding negated captions. Using CC-Neg along with modifications to the contrastive loss of CLIP, our proposed CoN-CLIP framework, has an improved understanding of negations. This training paradigm improves CoN-CLIP's ability to encode semantics reliably, resulting in 3.85% average gain in top-1 accuracy for zero-shot image classification across 8 datasets. Further, CoN-CLIP outperforms CLIP on challenging compositionality benchmarks such as SugarCREPE by 4.4%, showcasing emergent compositional understanding of objects, relations, and attributes in text. Overall, our work addresses a crucial limitation of VLMs by introducing a dataset and framework that strengthens semantic associations between images and text, demonstrating improved large-scale foundation models with significantly reduced computational cost, promoting efficiency and accessibility.
Related papers
- TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations.
We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution.
We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z) - Language-Driven Visual Consensus for Zero-Shot Semantic Segmentation [114.72734384299476]
We propose a Language-Driven Visual Consensus (LDVC) approach, fostering improved alignment of semantic and visual information.
We leverage class embeddings as anchors due to their discrete and abstract nature, steering vision features toward class embeddings.
Our approach significantly boosts the capacity of segmentation models for unseen classes.
arXiv Detail & Related papers (2024-03-13T11:23:55Z) - FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos [19.08882495584709]
We show that it is possible to enhance CLIP's fine-grained and syntactic abilities without compromising its semantic properties.
We adapt CLIP efficiently on a high-quality, comprehensive, and relatively small dataset.
We learn a powerful visual representation, dubbed Fine-Grained CLIP (FiGCLIP), that preserves semantic understanding while being detail-oriented.
arXiv Detail & Related papers (2024-01-15T13:27:34Z) - Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding [6.798129852396113]
We introduce a simple and effective method to improve compositional reasoning in Vision-Language Models (VLMs)
Our method better leverages available datasets by refining and expanding the standard image-text contrastive learning framework.
When integrated with CLIP, our technique yields notable improvement over state-of-the-art baselines.
arXiv Detail & Related papers (2023-06-15T03:26:28Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Unifying Graph Contrastive Learning with Flexible Contextual Scopes [57.86762576319638]
We present a self-supervised learning method termed Unifying Graph Contrastive Learning with Flexible Contextual Scopes (UGCL for short)
Our algorithm builds flexible contextual representations with contextual scopes by controlling the power of an adjacency matrix.
Based on representations from both local and contextual scopes, distL optimises a very simple contrastive loss function for graph representation learning.
arXiv Detail & Related papers (2022-10-17T07:16:17Z) - Label Anchored Contrastive Learning for Language Understanding [17.28721753405111]
We propose a novel label anchored contrastive learning approach (denoted as LaCon) for language understanding.
Our approach does not require any specialized network architecture or any extra data augmentation.
LaCon obtains up to 4.1% improvement on the popular datasets of GLUE and CLUE benchmarks.
arXiv Detail & Related papers (2022-04-26T15:33:01Z) - DenseCLIP: Extract Free Dense Labels from CLIP [130.3830819077699]
Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition.
DenseCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins.
Our finding suggests that DenseCLIP can serve as a new reliable source of supervision for dense prediction tasks.
arXiv Detail & Related papers (2021-12-02T09:23:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.