Related papers: Not Just What's There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning

Not Just What's There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning

URL: http://arxiv.org/abs/2602.21035v1
Date: Tue, 24 Feb 2026 15:55:39 GMT
Title: Not Just What's There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning
Authors: Junhao Xiao, Zhiyu Wu, Hao Lin, Yi Chen, Yahui Liu, Xiaoran Zhao, Zixu Wang, Zejiang He,
Abstract summary: Vision-Language Models (VLMs) like CLIP struggle to understand negation.<n>Existing methods refine negation understanding via fine-tuning CLIP's text encoder, risking overfitting.<n>We propose CLIPGlasses, a plug-and-play framework that enhances CLIP's ability to comprehend negated visual descriptions.
Score: 23.10421006625293
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language Models (VLMs) like CLIP struggle to understand negation, often embedding affirmatives and negatives similarly (e.g., matching "no dog" with dog images). Existing methods refine negation understanding via fine-tuning CLIP's text encoder, risking overfitting. In this work, we propose CLIPGlasses, a plug-and-play framework that enhances CLIP's ability to comprehend negated visual descriptions. CLIPGlasses adopts a dual-stage design: a Lens module disentangles negated semantics from text embeddings, and a Frame module predicts context-aware repulsion strength, which is integrated into a modified similarity computation to penalize alignment with negated semantics, thereby reducing false positive matches. Experiments show that CLIP equipped with CLIPGlasses achieves competitive in-domain performance and outperforms state-of-the-art methods in cross-domain generalization. Its superiority is especially evident under low-resource conditions, indicating stronger robustness across domains.

Related papers

SuperCLIP: CLIP with Simple Classification Supervision [88.86549733903314]
Contrastive Language-Image Pretraining achieves strong generalization in vision-language tasks by aligning images and texts in a shared embedding space.<n>Recent findings show that CLIP-like models still underutilize fine-grained semantic signals in text.<n>We propose SuperCLIP, a framework that augments contrastive learning with classification-based supervision.
arXiv Detail & Related papers (2025-12-16T15:11:53Z)
Contrastive vision-language learning with paraphrasing and negation [0.018665975431697426]
Contrastive language-Image Pre-training (CLIP) trains two neural networks in contrastive manner to align their image and text embeddings in a shared latent space.<n>Results evaluating CLIP on negated or paraphrased text have shown mixed performance because negation changes meaning radically with minimal lexical changes, while paraphrasing can create very different textual expressions with the same intended meaning.<n>This paper proposes a new CLIP contrastive loss function accounting for both paraphrasing and negation, and applies LLM-generated training triples consisting of original, paraphrased and negated textual captions to CLIP-like training models.
arXiv Detail & Related papers (2025-11-20T16:41:36Z)
Understanding Hardness of Vision-Language Compositionality from A Token-level Causal Lens [12.946160260124378]
Contrastive Language-Image Pre-training delivers strong cross modal generalization.<n>It persistently fails at compositional reasoning over objects, attributes, and relations.<n>We show the existence of pseudo-optimal text encoders that achieve perfect modal-invariant alignment.
arXiv Detail & Related papers (2025-10-30T09:41:21Z)
Enhancing CLIP Robustness via Cross-Modality Alignment [54.01929554563447]
We propose Cross-modality Alignment, an optimal transport-based framework for vision-language models.<n> COLA restores global image-text alignment and local structural consistency in the feature space.<n> COLA is training-free and compatible with existing fine-tuned models.
arXiv Detail & Related papers (2025-10-28T03:47:44Z)
AlignCAT: Visual-Linguistic Alignment of Category and Attribute for Weakly Supervised Visual Grounding [56.972490764212175]
Weakly supervised visual grounding aims to locate objects in images based on text descriptions.<n>Existing methods lack strong cross-modal reasoning to distinguish subtle semantic differences in text expressions.<n>We introduce AlignCAT, a novel query-based semantic matching framework for weakly supervised VG.
arXiv Detail & Related papers (2025-08-05T08:16:35Z)
VITRIX-CLIPIN: Enhancing Fine-Grained Visual Understanding in CLIP via Instruction Editing Data and Long Captions [16.90061119174727]
We present CLIP-IN, a novel framework that bolsters CLIP's fine-grained perception through two core innovations.<n> Firstly, we leverage instruction-editing datasets, originally designed for image manipulation, as a unique source of hard negative image-text pairs.<n> Secondly, CLIP-IN incorporates long captions, utilizing rotary positional encodings to capture rich semantic context often missed by standard CLIP.
arXiv Detail & Related papers (2025-08-04T11:57:10Z)
Know "No" Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP [57.33324843049638]
We introduce data generation pipelines that employ a large language model (LLM) and a multimodal LLM to produce negation-inclusive captions.<n>Fine-tuning CLIP with data generated from our pipelines, we develop NegationCLIP, which enhances negation awareness while preserving the generality.<n>Experiments on various CLIP architectures validate the effectiveness of our data generation pipelines in enhancing CLIP's ability to perceive negation accurately.
arXiv Detail & Related papers (2025-01-19T01:17:05Z)
TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations. We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution. We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z)
Learn "No" to Say "Yes" Better: Improving Vision-Language Models via Negations [43.484570564890866]
Existing vision-language models (VLMs) treat text descriptions as a unit, confusing individual concepts in a prompt. We present CC-Neg, a dataset containing 228,246 images, true captions and their corresponding negated captions. Using CC-Neg along with modifications to the contrastive loss of CLIP, our proposed CoN-CLIP framework, has an improved understanding of negations.
arXiv Detail & Related papers (2024-03-29T17:33:42Z)
No Token Left Behind: Explainability-Aided Image Classification and Generation [79.4957965474334]
We present a novel explainability-based approach, which adds a loss term to ensure that CLIP focuses on all relevant semantic parts of the input. Our method yields an improvement in the recognition rate, without additional training or fine-tuning.
arXiv Detail & Related papers (2022-04-11T07:16:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.