Attentive Mask CLIP
- URL: http://arxiv.org/abs/2212.08653v1
- Date: Fri, 16 Dec 2022 18:59:12 GMT
- Title: Attentive Mask CLIP
- Authors: Yifan Yang, Weiquan Huang, Yixuan Wei, Houwen Peng, Xinyang Jiang,
Huiqiang Jiang, Fangyun Wei, Yin Wang, Han Hu, Lili Qiu, Yuqing Yang
- Abstract summary: We propose an attentive token removal approach for CLIP training, which retains tokens with a high semantic correlation to the text description.
Our approach achieves $43.9%$ top-1 accuracy on ImageNet-1K zero-shot classification, as well as $62.7/42.1$ and $38.0/23.2$ I2T/T2I retrieval accuracy.
- Score: 48.206857783966996
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image token removal is an efficient augmentation strategy for reducing the
cost of computing image features. However, this efficient augmentation strategy
has been found to adversely affect the accuracy of CLIP-based training. We
hypothesize that removing a large portion of image tokens may improperly
discard the semantic content associated with a given text description, thus
constituting an incorrect pairing target in CLIP training. To address this
issue, we propose an attentive token removal approach for CLIP training, which
retains tokens with a high semantic correlation to the text description. The
correlation scores are computed in an online fashion using the EMA version of
the visual encoder. Our experiments show that the proposed attentive masking
approach performs better than the previous method of random token removal for
CLIP training. The approach also makes it efficient to apply multiple
augmentation views to the image, as well as introducing instance contrastive
learning tasks between these views into the CLIP framework. Compared to other
CLIP improvements that combine different pre-training targets such as SLIP and
MaskCLIP, our method is not only more effective, but also much more efficient.
Specifically, using ViT-B and YFCC-15M dataset, our approach achieves $43.9\%$
top-1 accuracy on ImageNet-1K zero-shot classification, as well as $62.7/42.1$
and $38.0/23.2$ I2T/T2I retrieval accuracy on Flickr30K and MS COCO, which are
$+1.1\%$, $+5.5/+0.9$, and $+4.4/+1.3$ higher than the SLIP method, while being
$2.30\times$ faster. An efficient version of our approach running $1.16\times$
faster than the plain CLIP model achieves significant gains of $+5.3\%$,
$+11.3/+8.0$, and $+9.5/+4.9$ on these benchmarks.
Related papers
- TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations.
We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution.
We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z) - Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification [54.96876797812238]
We present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts.
The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images.
Experiment results across various datasets and models confirm CODER's effectiveness.
arXiv Detail & Related papers (2024-04-27T02:04:36Z) - Learning Mask-aware CLIP Representations for Zero-Shot Segmentation [120.97144647340588]
Mask-awareProposals CLIP (IP-CLIP) is proposed to handle arbitrary numbers of image and mask proposals simultaneously.
mask-aware loss and self-distillation loss are designed to fine-tune IP-CLIP, ensuring CLIP is responsive to different mask proposals.
We conduct extensive experiments on the popular zero-shot benchmarks.
arXiv Detail & Related papers (2023-09-30T03:27:31Z) - ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens [75.09406436851445]
We propose a vision token pruning and merging method ELIP, to remove less influential tokens based on the supervision of language outputs.
Our experiments demonstrate that with the removal of 30$%$ vision tokens across 12 ViT layers, ELIP maintains significantly comparable performance.
arXiv Detail & Related papers (2023-09-28T05:31:07Z) - CLIP-KD: An Empirical Study of CLIP Model Distillation [24.52910358842176]
This paper aims to distill small CLIP models supervised by a large teacher CLIP model.
We show that a simple feature mimicry with Mean Squared Error loss works surprisingly well.
interactive contrastive learning across teacher and student encoders is also effective in performance improvement.
arXiv Detail & Related papers (2023-07-24T12:24:07Z) - An Inverse Scaling Law for CLIP Training [24.961315762769893]
We present a finding that there exists an inverse scaling law for CLIP training.
We are able to successfully train CLIP even with limited computational resources.
arXiv Detail & Related papers (2023-05-11T17:56:09Z) - Masked Autoencoding Does Not Help Natural Language Supervision at Scale [16.277390808400828]
We investigate whether a similar approach can be effective when trained with a much larger amount of data.
We find that a combination of two state of the art approaches: masked auto-encoders, MAE and contrastive language image pre-training, CLIP provides a benefit over CLIP when trained on a corpus of 11.3M image-text pairs.
arXiv Detail & Related papers (2023-01-19T01:05:18Z) - ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation [35.60888272729273]
Recently, CLIP has been applied to pixel-level zero-shot learning tasks via a two-stage scheme.
While effective, such a scheme requires two image encoders, one for proposal generation and one for CLIP, leading to a complicated pipeline and high computational cost.
We propose a simpler-and-efficient one-stage solution that directly extends CLIP's zero-shot prediction capability from image to pixel level.
arXiv Detail & Related papers (2022-12-07T12:05:00Z) - Dense Contrastive Learning for Self-Supervised Visual Pre-Training [102.15325936477362]
We present dense contrastive learning, which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images.
Compared to the baseline method MoCo-v2, our method introduces negligible computation overhead (only 1% slower)
arXiv Detail & Related papers (2020-11-18T08:42:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.