Masked Autoencoding Does Not Help Natural Language Supervision at Scale
- URL: http://arxiv.org/abs/2301.07836v4
- Date: Mon, 15 May 2023 17:05:32 GMT
- Title: Masked Autoencoding Does Not Help Natural Language Supervision at Scale
- Authors: Floris Weers, Vaishaal Shankar, Angelos Katharopoulos, Yinfei Yang,
Tom Gunter
- Abstract summary: We investigate whether a similar approach can be effective when trained with a much larger amount of data.
We find that a combination of two state of the art approaches: masked auto-encoders, MAE and contrastive language image pre-training, CLIP provides a benefit over CLIP when trained on a corpus of 11.3M image-text pairs.
- Score: 16.277390808400828
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self supervision and natural language supervision have emerged as two
exciting ways to train general purpose image encoders which excel at a variety
of downstream tasks. Recent works such as M3AE and SLIP have suggested that
these approaches can be effectively combined, but most notably their results
use small pre-training datasets (<50M samples) and don't effectively reflect
the large-scale regime (>100M examples) that is commonly used for these
approaches. Here we investigate whether a similar approach can be effective
when trained with a much larger amount of data. We find that a combination of
two state of the art approaches: masked auto-encoders, MAE and contrastive
language image pre-training, CLIP provides a benefit over CLIP when trained on
a corpus of 11.3M image-text pairs, but little to no benefit (as evaluated on a
suite of common vision tasks) over CLIP when trained on a large corpus of 1.4B
images. Our work provides some much needed clarity into the effectiveness (or
lack thereof) of self supervision for large-scale image-text training.
Related papers
- Enhancing Vision-Language Model with Unmasked Token Alignment [37.12838142681491]
This paper introduces Unmasked Token Alignment (UTA), a method that leverages existing CLIP models to further enhance its vision-language representations.
UTA trains a Vision Transformer (ViT) by aligning unmasked visual tokens to the corresponding image tokens from a frozen CLIP vision encoder, which automatically aligns the ViT model with the CLIP text encoder.
arXiv Detail & Related papers (2024-05-29T11:48:17Z) - CLIP with Quality Captions: A Strong Pretraining for Vision Tasks [16.208506912410147]
We show that CLIP pretraining with good quality captions can surpass recent supervised, self-supervised and weakly supervised pretraining methods.
We find that mobile architectures also benefit significantly from CLIP pretraining.
arXiv Detail & Related papers (2024-05-14T19:06:24Z) - ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens [75.09406436851445]
We propose a vision token pruning and merging method ELIP, to remove less influential tokens based on the supervision of language outputs.
Our experiments demonstrate that with the removal of 30$%$ vision tokens across 12 ViT layers, ELIP maintains significantly comparable performance.
arXiv Detail & Related papers (2023-09-28T05:31:07Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - Scaling Language-Image Pre-training via Masking [63.36988191660858]
Fast Language-Image Pre-training (FLIP) is a simple and more efficient method for training CLIP.
Masking allows us to learn from more image-text pairs given the same wall-clock time.
FLIP dominantly outperforms CLIP counterparts trained on the same data.
arXiv Detail & Related papers (2022-12-01T18:59:57Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Is a Caption Worth a Thousand Images? A Controlled Study for
Representation Learning [88.5382122413913]
We study whether language supervision can result in vision models with more transferable representations than traditional image-only methods.
We find that image-only methods do not match CLIP's transfer performance, even when they are trained with more image data.
Motivated by our findings, we devise simple prescriptions to enable CLIP to better leverage the language information present in existing pre-training datasets.
arXiv Detail & Related papers (2022-07-15T17:50:51Z) - Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.