Scaling Language-Image Pre-training via Masking
- URL: http://arxiv.org/abs/2212.00794v2
- Date: Thu, 30 Mar 2023 05:04:28 GMT
- Title: Scaling Language-Image Pre-training via Masking
- Authors: Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichtenhofer, Kaiming
He
- Abstract summary: Fast Language-Image Pre-training (FLIP) is a simple and more efficient method for training CLIP.
Masking allows us to learn from more image-text pairs given the same wall-clock time.
FLIP dominantly outperforms CLIP counterparts trained on the same data.
- Score: 63.36988191660858
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Fast Language-Image Pre-training (FLIP), a simple and more
efficient method for training CLIP. Our method randomly masks out and removes a
large portion of image patches during training. Masking allows us to learn from
more image-text pairs given the same wall-clock time and contrast more samples
per iteration with similar memory footprint. It leads to a favorable trade-off
between accuracy and training time. In our experiments on 400 million
image-text pairs, FLIP improves both accuracy and speed over the no-masking
baseline. On a large diversity of downstream tasks, FLIP dominantly outperforms
the CLIP counterparts trained on the same data. Facilitated by the speedup, we
explore the scaling behavior of increasing the model size, data size, or
training length, and report encouraging results and comparisons. We hope that
our work will foster future research on scaling vision-language learning.
Related papers
- Enhancing Vision-Language Model Pre-training with Image-text Pair Pruning Based on Word Frequency [0.0]
We propose Word-Frequency-based Image-Text Pair Pruning (WFPP), a novel data pruning method.
WFPP prunes text-image pairs containing high-frequency words across the entire training dataset.
Our experiments demonstrate that applying WFPP when training a CLIP model improves performance on a wide range of downstream tasks.
arXiv Detail & Related papers (2024-10-09T11:54:41Z) - Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning [78.19528555505961]
We propose a novel vision model pre-training method called Latent Compression Learning (LCL) for interleaved image-text data.
The training objective can be decomposed into two basic tasks: 1) contrastive learning between visual representation and preceding context, and 2) generating subsequent text based on visual representation.
Our experiments demonstrate that our method not only matches the performance of CLIP on paired pre-training datasets, but can also leverage interleaved pre-training data.
arXiv Detail & Related papers (2024-06-11T17:59:35Z) - Efficient Vision-Language Pre-training by Cluster Masking [13.845233914223561]
We propose a simple strategy for masking image patches during visual-language contrastive learning.
We randomly mask clusters of visually similar image patches, as measured by their raw pixel intensities.
This provides an extra learning signal, beyond the contrastive training itself, since it forces a model to predict words for masked visual structures solely from context.
arXiv Detail & Related papers (2024-05-14T17:59:40Z) - CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data [40.88256210436378]
We present a novel weakly supervised pre-training of vision models on web-scale image-text data.
The proposed method reframes pre-training on image-text data as a classification task.
It achieves a remarkable $2.7times$ acceleration in training speed compared to contrastive learning on web-scale data.
arXiv Detail & Related papers (2024-04-24T05:13:28Z) - Centered Masking for Language-Image Pre-Training [0.0]
We introduce Gaussian masking for Language-Image Pre-Training (GLIP)
GLIP is a novel, straightforward, and effective technique for masking image patches during pre-training of a vision-language model.
arXiv Detail & Related papers (2024-03-23T13:24:31Z) - ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens [75.09406436851445]
We propose a vision token pruning and merging method ELIP, to remove less influential tokens based on the supervision of language outputs.
Our experiments demonstrate that with the removal of 30$%$ vision tokens across 12 ViT layers, ELIP maintains significantly comparable performance.
arXiv Detail & Related papers (2023-09-28T05:31:07Z) - ALIP: Adaptive Language-Image Pre-training with Synthetic Caption [78.93535202851278]
Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks.
The presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning.
We propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption.
arXiv Detail & Related papers (2023-08-16T15:19:52Z) - Improving CLIP Training with Language Rewrites [57.935517901210225]
We introduce Language augmented CLIP (LaCLIP) to enhance CLIP training through language rewrites.
We show that LaCLIP significantly improves the transfer performance without computation or memory overhead during training.
Specifically for ImageNet zero-shot accuracy, LaCLIP outperforms CLIP by 8.2% on CC12M and 2.4% on LAION-400M.
arXiv Detail & Related papers (2023-05-31T17:59:04Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - ProtoCLIP: Prototypical Contrastive Language Image Pretraining [12.067061175987075]
Prototypical Contrastive Language Image Pretraining (ProtoCLIP) is introduced to enhance such grouping.
ProtoCLIP sets up prototype-level discrimination between image and text spaces, which efficiently transfers higher-level structural knowledge.
ProtoCLIP is trained with an online episodic training strategy, which makes it can be scaled up to unlimited amounts of data.
arXiv Detail & Related papers (2022-06-22T11:55:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.