TNG-CLIP:Training-Time Negation Data Generation for Negation Awareness of CLIP
- URL: http://arxiv.org/abs/2505.18434v1
- Date: Sat, 24 May 2025 00:02:48 GMT
- Title: TNG-CLIP:Training-Time Negation Data Generation for Negation Awareness of CLIP
- Authors: Yuliang Cai, Jesse Thomason, Mohammad Rostami,
- Abstract summary: We introduce a training-time negation data generation pipeline for vision-language models (VLMs)<n>We propose the first benchmark, Neg-TtoI, for evaluating text-to-image generation models on prompts containing negation.<n>We show that our proposed method, TNG-CLIP, achieves SOTA performance on diverse negation benchmarks of image-to-text matching, text-to-image retrieval, and image generation.
- Score: 27.33315565500182
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-language models (VLMs), such as CLIP, have demonstrated strong performance across a range of downstream tasks. However, CLIP is still limited in negation understanding: the ability to recognize the absence or exclusion of a concept. Existing methods address the problem by using a large language model (LLM) to generate large-scale data of image captions containing negation for further fine-tuning CLIP. However, these methods are both time- and compute-intensive, and their evaluations are typically restricted to image-text matching tasks. To expand the horizon, we (1) introduce a training-time negation data generation pipeline such that negation captions are generated during the training stage, which only increases 2.5% extra training time, and (2) we propose the first benchmark, Neg-TtoI, for evaluating text-to-image generation models on prompts containing negation, assessing model's ability to produce semantically accurate images. We show that our proposed method, TNG-CLIP, achieves SOTA performance on diverse negation benchmarks of image-to-text matching, text-to-image retrieval, and image generation.
Related papers
- QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation [101.28446308930367]
Quantized Language-Image Pretraining (QLIP) combines state-of-the-art reconstruction quality with state-of-the-art zero-shot image understanding.<n>QLIP trains a binary-spherical-quantization-based autoencoder with reconstruction and language-image alignment objectives.<n>We demonstrate that QLIP enables a unified mixed-modality auto-regressive model for understanding and generation.
arXiv Detail & Related papers (2025-02-07T18:59:57Z) - TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations.
We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution.
We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z) - Learn "No" to Say "Yes" Better: Improving Vision-Language Models via Negations [43.484570564890866]
Existing vision-language models (VLMs) treat text descriptions as a unit, confusing individual concepts in a prompt.
We present CC-Neg, a dataset containing 228,246 images, true captions and their corresponding negated captions.
Using CC-Neg along with modifications to the contrastive loss of CLIP, our proposed CoN-CLIP framework, has an improved understanding of negations.
arXiv Detail & Related papers (2024-03-29T17:33:42Z) - DisCLIP: Open-Vocabulary Referring Expression Generation [37.789850573203694]
We build on CLIP, a large-scale visual-semantic model, to guide an LLM to generate a contextual description of a target concept in an image.
We measure the quality of the generated text by evaluating the capability of a receiver model to accurately identify the described object within the scene.
Our results highlight the potential of using pre-trained visual-semantic models for generating high-quality contextual descriptions.
arXiv Detail & Related papers (2023-05-30T15:13:17Z) - Teaching CLIP to Count to Ten [18.703050317383322]
We introduce a simple yet effective method to improve the quantitative understanding of large vision-language models (VLMs)
We propose a new counting-contrastive loss used to finetune a pre-trained VLM in tandem with its original objective.
To the best of our knowledge, this work is the first to extend CLIP's capabilities to object counting.
arXiv Detail & Related papers (2023-02-23T14:43:53Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Is a Caption Worth a Thousand Images? A Controlled Study for
Representation Learning [88.5382122413913]
We study whether language supervision can result in vision models with more transferable representations than traditional image-only methods.
We find that image-only methods do not match CLIP's transfer performance, even when they are trained with more image data.
Motivated by our findings, we devise simple prescriptions to enable CLIP to better leverage the language information present in existing pre-training datasets.
arXiv Detail & Related papers (2022-07-15T17:50:51Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained
Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation.
In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP.
Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.