RECLIP: Resource-efficient CLIP by Training with Small Images
- URL: http://arxiv.org/abs/2304.06028v2
- Date: Thu, 31 Aug 2023 04:36:04 GMT
- Title: RECLIP: Resource-efficient CLIP by Training with Small Images
- Authors: Runze Li, Dahun Kim, Bir Bhanu, Weicheng Kuo
- Abstract summary: We present RECLIP, a simple method that minimizes computational resource footprint for CLIP (Contrastive Language Image Pretraining)
Inspired by the notion of coarse-to-fine in computer vision, we leverage small images to learn from large-scale language supervision efficiently.
- Score: 44.7490122024181
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present RECLIP (Resource-efficient CLIP), a simple method that minimizes
computational resource footprint for CLIP (Contrastive Language Image
Pretraining). Inspired by the notion of coarse-to-fine in computer vision, we
leverage small images to learn from large-scale language supervision
efficiently, and finetune the model with high-resolution data in the end. Since
the complexity of the vision transformer heavily depends on input image size,
our approach significantly reduces the training resource requirements both in
theory and in practice. Using the same batch size and training epoch, RECLIP
achieves highly competitive zero-shot classification and image-text retrieval
accuracy with 6 to 8x less computational resources and 7 to 9x fewer FLOPs than
the baseline. Compared to the state-of-the-art contrastive learning methods,
RECLIP demonstrates 5 to 59x training resource savings while maintaining highly
competitive zero-shot classification and retrieval performance. Finally, RECLIP
matches the state of the art in transfer learning to open-vocabulary detection
tasks, achieving 32 APr on LVIS. We hope this work will pave the path for the
broader research community to explore language supervised pretraining in
resource-friendly settings.
Related papers
- TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations.
We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution.
We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z) - ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens [75.09406436851445]
We propose a vision token pruning and merging method ELIP, to remove less influential tokens based on the supervision of language outputs.
Our experiments demonstrate that with the removal of 30$%$ vision tokens across 12 ViT layers, ELIP maintains significantly comparable performance.
arXiv Detail & Related papers (2023-09-28T05:31:07Z) - Improving CLIP Training with Language Rewrites [57.935517901210225]
We introduce Language augmented CLIP (LaCLIP) to enhance CLIP training through language rewrites.
We show that LaCLIP significantly improves the transfer performance without computation or memory overhead during training.
Specifically for ImageNet zero-shot accuracy, LaCLIP outperforms CLIP by 8.2% on CC12M and 2.4% on LAION-400M.
arXiv Detail & Related papers (2023-05-31T17:59:04Z) - Improved baselines for vision-language pre-training [26.395527650984025]
We propose, implement and evaluate several baselines obtained by combining contrastive learning with self-supervised learning.
We find that these baselines outperform a basic implementation of CLIP.
We find that a simple CLIP baseline can also be improved substantially, up to a 25% relative improvement on downstream zero-shot tasks.
arXiv Detail & Related papers (2023-05-15T14:31:49Z) - An Inverse Scaling Law for CLIP Training [24.961315762769893]
We present a finding that there exists an inverse scaling law for CLIP training.
We are able to successfully train CLIP even with limited computational resources.
arXiv Detail & Related papers (2023-05-11T17:56:09Z) - Scaling Language-Image Pre-training via Masking [63.36988191660858]
Fast Language-Image Pre-training (FLIP) is a simple and more efficient method for training CLIP.
Masking allows us to learn from more image-text pairs given the same wall-clock time.
FLIP dominantly outperforms CLIP counterparts trained on the same data.
arXiv Detail & Related papers (2022-12-01T18:59:57Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - ProtoCLIP: Prototypical Contrastive Language Image Pretraining [12.067061175987075]
Prototypical Contrastive Language Image Pretraining (ProtoCLIP) is introduced to enhance such grouping.
ProtoCLIP sets up prototype-level discrimination between image and text spaces, which efficiently transfers higher-level structural knowledge.
ProtoCLIP is trained with an online episodic training strategy, which makes it can be scaled up to unlimited amounts of data.
arXiv Detail & Related papers (2022-06-22T11:55:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.