Less is More: Removing Text-regions Improves CLIP Training Efficiency
and Robustness
- URL: http://arxiv.org/abs/2305.05095v1
- Date: Mon, 8 May 2023 23:47:07 GMT
- Title: Less is More: Removing Text-regions Improves CLIP Training Efficiency
and Robustness
- Authors: Liangliang Cao, Bowen Zhang, Chen Chen, Yinfei Yang, Xianzhi Du,
Wencong Zhang, Zhiyun Lu, Yantao Zheng
- Abstract summary: The CLIP (Contrastive Language-Image Pre-training) model and its variants are becoming the de facto backbone in many applications.
We discuss two effective approaches to improve the efficiency and robustness of CLIP training.
Our filter-based CLIP model demonstrates a top-1 accuracy of 68.78%, outperforming previous models whose accuracy was all below 50%.
- Score: 19.77762574325687
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The CLIP (Contrastive Language-Image Pre-training) model and its variants are
becoming the de facto backbone in many applications. However, training a CLIP
model from hundreds of millions of image-text pairs can be prohibitively
expensive. Furthermore, the conventional CLIP model doesn't differentiate
between the visual semantics and meaning of text regions embedded in images.
This can lead to non-robustness when the text in the embedded region doesn't
match the image's visual appearance. In this paper, we discuss two effective
approaches to improve the efficiency and robustness of CLIP training: (1)
augmenting the training dataset while maintaining the same number of
optimization steps, and (2) filtering out samples that contain text regions in
the image. By doing so, we significantly improve the classification and
retrieval accuracy on public benchmarks like ImageNet and CoCo. Filtering out
images with text regions also protects the model from typographic attacks. To
verify this, we build a new dataset named ImageNet with Adversarial Text
Regions (ImageNet-Attr). Our filter-based CLIP model demonstrates a top-1
accuracy of 68.78\%, outperforming previous models whose accuracy was all below
50\%.
Related papers
- TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations.
We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution.
We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z) - ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free
Domain Adaptation [20.57370550156505]
ReCLIP is a source-free domain adaptation method for vision-language models.
We demonstrate ReCLIP reduces the average error rate of CLIP from 30.17% to 25.06% on 22 image classification benchmarks.
arXiv Detail & Related papers (2023-08-04T18:11:40Z) - Improving CLIP Training with Language Rewrites [57.935517901210225]
We introduce Language augmented CLIP (LaCLIP) to enhance CLIP training through language rewrites.
We show that LaCLIP significantly improves the transfer performance without computation or memory overhead during training.
Specifically for ImageNet zero-shot accuracy, LaCLIP outperforms CLIP by 8.2% on CC12M and 2.4% on LAION-400M.
arXiv Detail & Related papers (2023-05-31T17:59:04Z) - Turning a CLIP Model into a Scene Text Detector [56.86413150091367]
Recently, pretraining approaches based on vision language models have made effective progresses in the field of text detection.
This paper proposes a new method, termed TCM, focusing on Turning the CLIP Model directly for text detection without pretraining process.
arXiv Detail & Related papers (2023-02-28T06:06:12Z) - STAIR: Learning Sparse Text and Image Representation in Grounded Tokens [84.14528645941128]
We show that it is possible to build a sparse semantic representation that is as powerful as, or even better than, dense presentations.
We extend the CLIP model and build a sparse text and image representation (STAIR), where the image and text are mapped to a sparse token space.
It significantly outperforms a CLIP model with +$4.9%$ and +$4.3%$ absolute Recall@1 improvement.
arXiv Detail & Related papers (2023-01-30T17:21:30Z) - Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP [45.81698881151867]
Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during training.
Recent two-stage methods first generate class-agnostic mask proposals and then leverage pre-trained vision-language models, e.g., CLIP, to classify masked regions.
We propose to finetune CLIP on a collection of masked image regions and their corresponding text descriptions.
In particular, when trained on COCO and evaluated on ADE20K-150, our best model achieves 29.6% mIoU, which is +8.5% higher than the previous state-
arXiv Detail & Related papers (2022-10-09T02:57:32Z) - CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language
Representation Alignment [146.3128011522151]
We propose a Omni Crossmodal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP.
Our approach improves the performance of CLIP on video-text retrieval by a large margin.
Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet.
arXiv Detail & Related papers (2022-09-14T05:47:02Z) - RegionCLIP: Region-based Language-Image Pretraining [94.29924084715316]
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification.
We propose a new method called RegionCLIP that significantly extends CLIP to learn region-level visual representations.
Our method significantly outperforms the state of the art by 3.8 AP50 and 2.2 AP for novel categories on COCO and LVIS datasets.
arXiv Detail & Related papers (2021-12-16T18:39:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.