The Hard Positive Truth about Vision-Language Compositionality
- URL: http://arxiv.org/abs/2409.17958v1
- Date: Thu, 26 Sep 2024 15:36:10 GMT
- Title: The Hard Positive Truth about Vision-Language Compositionality
- Authors: Amita Kamath, Cheng-Yu Hsieh, Kai-Wei Chang, Ranjay Krishna,
- Abstract summary: We investigate whether finetuned vision-language models remain invariant to hard positives.
We produce a 1,775,259 image-text training set with both hard negative and hard positive captions.
Our work suggests the need for future research to rigorously test and improve CLIP's understanding of semantic relationships between related "positive" concepts.
- Score: 64.8065854134201
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Several benchmarks have concluded that our best vision-language models (e.g., CLIP) are lacking in compositionality. Given an image, these benchmarks probe a model's ability to identify its associated caption amongst a set of compositional distractors. In response, a surge of recent proposals show improvements by finetuning CLIP with distractors as hard negatives. Our investigations reveal that these improvements have, in fact, been significantly overstated -- because existing benchmarks do not probe whether finetuned vision-language models remain invariant to hard positives. By curating an evaluation dataset with 112,382 hard negatives and hard positives, we uncover that including hard positives decreases CLIP's performance by 12.9%, while humans perform effortlessly at 99%. CLIP finetuned with hard negatives results in an even larger decrease, up to 38.7%. With this finding, we then produce a 1,775,259 image-text training set with both hard negative and hard positive captions. By training with both, we see improvements on existing benchmarks while simultaneously improving performance on hard positives, indicating a more robust improvement in compositionality. Our work suggests the need for future research to rigorously test and improve CLIP's understanding of semantic relationships between related "positive" concepts.
Related papers
- TAPT: Test-Time Adversarial Prompt Tuning for Robust Inference in Vision-Language Models [53.91006249339802]
We propose a novel defense method called Test-Time Adversarial Prompt Tuning (TAPT) to enhance the inference robustness of CLIP against visual adversarial attacks.
TAPT is a test-time defense method that learns defensive bimodal (textual and visual) prompts to robustify the inference process of CLIP.
We evaluate the effectiveness of TAPT on 11 benchmark datasets, including ImageNet and 10 other zero-shot datasets.
arXiv Detail & Related papers (2024-11-20T08:58:59Z) - Learn "No" to Say "Yes" Better: Improving Vision-Language Models via Negations [43.484570564890866]
Existing vision-language models (VLMs) treat text descriptions as a unit, confusing individual concepts in a prompt.
We present CC-Neg, a dataset containing 228,246 images, true captions and their corresponding negated captions.
Using CC-Neg along with modifications to the contrastive loss of CLIP, our proposed CoN-CLIP framework, has an improved understanding of negations.
arXiv Detail & Related papers (2024-03-29T17:33:42Z) - When hard negative sampling meets supervised contrastive learning [17.173114048398947]
We introduce a new supervised contrastive learning objective, SCHaNe, which incorporates hard negative sampling during the fine-tuning phase.
SchaNe outperforms the strong baseline BEiT-3 in Top-1 accuracy across various benchmarks.
Our proposed objective sets a new state-of-the-art for base models on ImageNet-1k, achieving an 86.14% accuracy.
arXiv Detail & Related papers (2023-08-28T20:30:10Z) - ScoreCL: Augmentation-Adaptive Contrastive Learning via Score-Matching Function [14.857965612960475]
Self-supervised contrastive learning (CL) has achieved state-of-the-art performance in representation learning.
We show the generality of our method, referred to as ScoreCL, by consistently improving various CL methods.
arXiv Detail & Related papers (2023-06-07T05:59:20Z) - CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1
Accuracy with ViT-B and ViT-L on ImageNet [139.56863124214905]
We find that fine-tuning performance of CLIP is substantially underestimated.
Specifically, CLIP ViT-Base/16 and CLIP ViT-Large/14 can achieve 85.7%,88.0% finetuning Top-1 accuracy on the ImageNet-1K dataset.
arXiv Detail & Related papers (2022-12-12T18:59:59Z) - Revisiting Contrastive Learning through the Lens of Neighborhood
Component Analysis: an Integrated Framework [70.84906094606072]
We show a new methodology to design integrated contrastive losses that could simultaneously achieve good accuracy and robustness on downstream tasks.
With the integrated framework, we achieve up to 6% improvement on the standard accuracy and 17% improvement on the adversarial accuracy.
arXiv Detail & Related papers (2021-12-08T18:54:11Z) - Contrastive Attraction and Contrastive Repulsion for Representation
Learning [131.72147978462348]
Contrastive learning (CL) methods learn data representations in a self-supervision manner, where the encoder contrasts each positive sample over multiple negative samples.
Recent CL methods have achieved promising results when pretrained on large-scale datasets, such as ImageNet.
We propose a doubly CL strategy that separately compares positive and negative samples within their own groups, and then proceeds with a contrast between positive and negative groups.
arXiv Detail & Related papers (2021-05-08T17:25:08Z) - Understanding and Achieving Efficient Robustness with Adversarial
Contrastive Learning [34.97017489872795]
Adversarial Supervised Contrastive Learning (ASCL) approach outperforms the state-of-the-art defenses by $2.6%$ in terms of the robust accuracy.
Our ASCL with the proposed selection strategy can further gain $1.4%$ improvement with only $42.8%$ positives and $6.3%$ negatives compared with ASCL without a selection strategy.
arXiv Detail & Related papers (2021-01-25T11:57:52Z) - NPCFace: Negative-Positive Collaborative Training for Large-scale Face
Recognition [78.21084529159577]
We study how to make better use of hard samples for improving the training.
The correlation between hard positive and hard negative is overlooked, and so is the relation between the margins in positive and negative logits.
We propose a novel Negative-Positive Collaboration loss, named NPCFace, which emphasizes the training on both negative and positive hard cases.
arXiv Detail & Related papers (2020-07-20T14:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.