Contrastive vision-language learning with paraphrasing and negation
- URL: http://arxiv.org/abs/2511.16527v1
- Date: Thu, 20 Nov 2025 16:41:36 GMT
- Title: Contrastive vision-language learning with paraphrasing and negation
- Authors: Kwun Ho Ngan, Saman Sadeghi Afgeh, Joe Townsend, Artur d'Avila Garcez,
- Abstract summary: Contrastive language-Image Pre-training (CLIP) trains two neural networks in contrastive manner to align their image and text embeddings in a shared latent space.<n>Results evaluating CLIP on negated or paraphrased text have shown mixed performance because negation changes meaning radically with minimal lexical changes, while paraphrasing can create very different textual expressions with the same intended meaning.<n>This paper proposes a new CLIP contrastive loss function accounting for both paraphrasing and negation, and applies LLM-generated training triples consisting of original, paraphrased and negated textual captions to CLIP-like training models.
- Score: 0.018665975431697426
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Contrastive vision-language models continue to be the dominant approach for image and text retrieval. Contrastive Language-Image Pre-training (CLIP) trains two neural networks in contrastive manner to align their image and text embeddings in a shared latent space. Recent results evaluating CLIP on negated or paraphrased text have shown mixed performance because negation changes meaning radically with minimal lexical changes, while paraphrasing can create very different textual expressions with the same intended meaning. This poses a significant challenge for improving the evaluation results and alignment of vision-language models. To address this challenge, this paper evaluates the combination of paraphrasing and negation, proposes a new CLIP contrastive loss function accounting for both paraphrasing and negation, and applies LLM-generated training triples consisting of original, paraphrased and negated textual captions to CLIP-like training models. The approach, called SemCLIP, is shown to move paraphrased captions towards the original image embeddings while pushing negated captions further away in embedding space. Empirically, SemCLIP is shown to be capable of preserving CLIP's performance while increasing considerably the distances to negated captions. On the CC-Neg benchmark using an original over negation image-retrieval accuracy metric, SemCLIP improves accuracy from 68.1% to 78.1%. Although results are mixed when compared with CLIP on the Sugarcrepe++ benchmark, SemCLIP's performance is generally better than the models trained with negated captions. This robustness to negation extends to downstream zero-shot classification tasks where SemCLIP pre-trained on Sugarcrepe++ performs better than CLIP on all tested downstream tasks. These results indicate that SemCLIP can achieve significant robustness to semantic transformations.
Related papers
- SuperCLIP: CLIP with Simple Classification Supervision [88.86549733903314]
Contrastive Language-Image Pretraining achieves strong generalization in vision-language tasks by aligning images and texts in a shared embedding space.<n>Recent findings show that CLIP-like models still underutilize fine-grained semantic signals in text.<n>We propose SuperCLIP, a framework that augments contrastive learning with classification-based supervision.
arXiv Detail & Related papers (2025-12-16T15:11:53Z) - TNG-CLIP:Training-Time Negation Data Generation for Negation Awareness of CLIP [27.33315565500182]
We introduce a training-time negation data generation pipeline for vision-language models (VLMs)<n>We propose the first benchmark, Neg-TtoI, for evaluating text-to-image generation models on prompts containing negation.<n>We show that our proposed method, TNG-CLIP, achieves SOTA performance on diverse negation benchmarks of image-to-text matching, text-to-image retrieval, and image generation.
arXiv Detail & Related papers (2025-05-24T00:02:48Z) - Know "No" Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP [57.33324843049638]
We introduce data generation pipelines that employ a large language model (LLM) and a multimodal LLM to produce negation-inclusive captions.<n>Fine-tuning CLIP with data generated from our pipelines, we develop NegationCLIP, which enhances negation awareness while preserving the generality.<n>Experiments on various CLIP architectures validate the effectiveness of our data generation pipelines in enhancing CLIP's ability to perceive negation accurately.
arXiv Detail & Related papers (2025-01-19T01:17:05Z) - TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations.
We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution.
We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z) - Learn "No" to Say "Yes" Better: Improving Vision-Language Models via Negations [43.484570564890866]
Existing vision-language models (VLMs) treat text descriptions as a unit, confusing individual concepts in a prompt.
We present CC-Neg, a dataset containing 228,246 images, true captions and their corresponding negated captions.
Using CC-Neg along with modifications to the contrastive loss of CLIP, our proposed CoN-CLIP framework, has an improved understanding of negations.
arXiv Detail & Related papers (2024-03-29T17:33:42Z) - Improving CLIP Training with Language Rewrites [57.935517901210225]
We introduce Language augmented CLIP (LaCLIP) to enhance CLIP training through language rewrites.
We show that LaCLIP significantly improves the transfer performance without computation or memory overhead during training.
Specifically for ImageNet zero-shot accuracy, LaCLIP outperforms CLIP by 8.2% on CC12M and 2.4% on LAION-400M.
arXiv Detail & Related papers (2023-05-31T17:59:04Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - CyCLIP: Cyclic Contrastive Language-Image Pretraining [34.588147979731374]
Recent advances in contrastive representation learning over paired image-text data have led to models such as CLIP that achieve state-of-the-art performance for zero-shot classification and distributional robustness.
We demonstrate that the image and text representations learned via a standard contrastive objective are not interchangeable and can lead to inconsistent downstream predictions.
We propose CyCLIP, a framework for contrastive representation learning that explicitly optimize for the learned representations to be geometrically consistent in the image and text space.
arXiv Detail & Related papers (2022-05-28T15:31:17Z) - DenseCLIP: Extract Free Dense Labels from CLIP [130.3830819077699]
Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition.
DenseCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins.
Our finding suggests that DenseCLIP can serve as a new reliable source of supervision for dense prediction tasks.
arXiv Detail & Related papers (2021-12-02T09:23:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.