Teaching CLIP to Count to Ten
        - URL: http://arxiv.org/abs/2302.12066v1
- Date: Thu, 23 Feb 2023 14:43:53 GMT
- Title: Teaching CLIP to Count to Ten
- Authors: Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal
  Irani and Tali Dekel
- Abstract summary: We introduce a simple yet effective method to improve the quantitative understanding of large vision-language models (VLMs)
We propose a new counting-contrastive loss used to finetune a pre-trained VLM in tandem with its original objective.
To the best of our knowledge, this work is the first to extend CLIP's capabilities to object counting.
- Score: 18.703050317383322
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   Large vision-language models (VLMs), such as CLIP, learn rich joint
image-text representations, facilitating advances in numerous downstream tasks,
including zero-shot classification and text-to-image generation. Nevertheless,
existing VLMs exhibit a prominent well-documented limitation - they fail to
encapsulate compositional concepts such as counting. We introduce a simple yet
effective method to improve the quantitative understanding of VLMs, while
maintaining their overall performance on common benchmarks. Specifically, we
propose a new counting-contrastive loss used to finetune a pre-trained VLM in
tandem with its original objective. Our counting loss is deployed over
automatically-created counterfactual examples, each consisting of an image and
a caption containing an incorrect object count. For example, an image depicting
three dogs is paired with the caption "Six dogs playing in the yard". Our loss
encourages discrimination between the correct caption and its counterfactual
variant which serves as a hard negative example. To the best of our knowledge,
this work is the first to extend CLIP's capabilities to object counting.
Furthermore, we introduce "CountBench" - a new image-text counting benchmark
for evaluating a model's understanding of object counting. We demonstrate a
significant improvement over state-of-the-art baseline models on this task.
Finally, we leverage our count-aware CLIP model for image retrieval and
text-conditioned image generation, demonstrating that our model can produce
specific counts of objects more reliably than existing ones.
 
      
        Related papers
        - TULIP: Towards Unified Language-Image Pretraining [60.99500935831526]
 We introduce T, an open-source, drop-in replacement for existing CLIP-like models.
Our method leverages generative data augmentation, enhanced image-image and text-text contrastive learning, and image/text reconstruction regularization to learn fine-grained visual features.
Our approach, scaling to over 1B parameters, outperforms existing state-of-the-art (SOTA) models across benchmarks.
 arXiv  Detail & Related papers  (2025-03-19T17:58:57Z)
- Learning Visual Composition through Improved Semantic Guidance [19.24813992815684]
 We show that by substantially improving weakly labeled data, we can vastly improve the performance of standard contrastive learning approaches.
We showcase our results on a relatively new captioning benchmark derived from DOCCI.
We demonstrate through a series of ablations that a standard CLIP model trained with enhanced data may demonstrate impressive performance on image retrieval tasks.
 arXiv  Detail & Related papers  (2024-12-19T20:58:26Z)
- Grounding Descriptions in Images informs Zero-Shot Visual Recognition [47.66166611138081]
 We propose GRAIN, a new pretraining strategy aimed at aligning representations at both fine and coarse levels simultaneously.
We demonstrate the enhanced zero-shot performance of our model compared to current state-of-the art methods.
 arXiv  Detail & Related papers  (2024-12-05T18:52:00Z)
- TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic   Vision-Language Negatives [65.82577305915643]
 Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations.
We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution.
We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
 arXiv  Detail & Related papers  (2024-11-04T19:24:59Z)
- Diffusion Feedback Helps CLIP See Better [40.125318318373715]
 Contrastive Language-Image Pre-training (CLIP) excels at abstracting open-world representations across domains and modalities.
CLIP has severe visual shortcomings, such as which can hardly distinguish orientation, quantity, color, structure.
We present a post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process.
 arXiv  Detail & Related papers  (2024-07-29T17:00:09Z)
- Zero-shot Object Counting with Good Exemplars [35.7544908318547]
 Zero-shot object counting (ZOC) aims to enumerate objects in images using only the names of object classes during testing, without the need for manual annotations.
We propose the Visual Association-based Zero-shot Object Counting (VA-Count) framework.
VA-Count consists of an Exemplar Enhancement Module (EEM) and a Noise Suppression Module (NSM) that synergistically refine the process of class exemplar identification while minimizing the consequences of incorrect object identification.
 arXiv  Detail & Related papers  (2024-07-06T03:37:22Z)
- CountCLIP -- [Re] Teaching CLIP to Count to Ten [0.0]
 This paper conducts a study of 'Teaching CLIP to Count to Ten'
It presents a method to finetune a CLIP model to improve zero-shot counting accuracy in an image.
We improve the model's performance on a smaller subset of their training data with lower computational resources.
 arXiv  Detail & Related papers  (2024-06-05T19:05:08Z)
- CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement [65.47237619200442]
 Contrastive language image pretraining (CLIP) is a standard method for training vision-language models.
We augment CLIP training with task-specific vision models from model zoos to improve its visual representations.
This simple setup shows substantial improvements of up to 16.3% across different vision tasks.
 arXiv  Detail & Related papers  (2023-10-21T20:20:13Z)
- Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to   Enhance Visio-Linguistic Compositional Understanding [6.798129852396113]
 We introduce a simple and effective method to improve compositional reasoning in Vision-Language Models (VLMs)
Our method better leverages available datasets by refining and expanding the standard image-text contrastive learning framework.
When integrated with CLIP, our technique yields notable improvement over state-of-the-art baselines.
 arXiv  Detail & Related papers  (2023-06-15T03:26:28Z)
- Text encoders bottleneck compositionality in contrastive vision-language
  models [76.2406963762722]
 We train text-only recovery probes that aim to reconstruct captions from single-vector text representations.
We find that CLIP's text encoder falls short on more compositional inputs.
Results suggest text-only recoverability is a necessary (but not sufficient) condition for modeling compositional factors.
 arXiv  Detail & Related papers  (2023-05-24T08:48:44Z)
- Generative Negative Text Replay for Continual Vision-Language
  Pretraining [95.2784858069843]
 Vision-language pre-training has attracted increasing attention recently.
Massive data are usually collected in a streaming fashion.
We propose a multi-modal knowledge distillation between images and texts to align the instance-wise prediction between old and new models.
 arXiv  Detail & Related papers  (2022-10-31T13:42:21Z)
- Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
 We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
 arXiv  Detail & Related papers  (2022-10-17T17:57:46Z)
- No Token Left Behind: Explainability-Aided Image Classification and
  Generation [79.4957965474334]
 We present a novel explainability-based approach, which adds a loss term to ensure that CLIP focuses on all relevant semantic parts of the input.
Our method yields an improvement in the recognition rate, without additional training or fine-tuning.
 arXiv  Detail & Related papers  (2022-04-11T07:16:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.