ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens
- URL: http://arxiv.org/abs/2309.16738v2
- Date: Fri, 17 Nov 2023 06:38:30 GMT
- Title: ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens
- Authors: Yangyang Guo and Haoyu Zhang and Yongkang Wong and Liqiang Nie and
Mohan Kankanhalli
- Abstract summary: We propose a vision token pruning and merging method ELIP, to remove less influential tokens based on the supervision of language outputs.
Our experiments demonstrate that with the removal of 30$%$ vision tokens across 12 ViT layers, ELIP maintains significantly comparable performance.
- Score: 75.09406436851445
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning a versatile language-image model is computationally prohibitive
under a limited computing budget. This paper delves into the \emph{efficient
language-image pre-training}, an area that has received relatively little
attention despite its importance in reducing computational cost and footprint.
To that end, we propose a vision token pruning and merging method ELIP, to
remove less influential tokens based on the supervision of language outputs.
Our method is designed with several strengths, such as being
computation-efficient, memory-efficient, and trainable-parameter-free, and is
distinguished from previous vision-only token pruning approaches by its
alignment with task objectives. We implement this method in a progressively
pruning manner using several sequential blocks. To evaluate its generalization
performance, we apply ELIP to three commonly used language-image pre-training
models and utilize public image-caption pairs with 4M images for pre-training.
Our experiments demonstrate that with the removal of ~30$\%$ vision tokens
across 12 ViT layers, ELIP maintains significantly comparable performance with
baselines ($\sim$0.32 accuracy drop on average) over various downstream tasks
including cross-modal retrieval, VQA, image captioning, \emph{etc}. In
addition, the spared GPU resources by our ELIP allow us to scale up with larger
batch sizes, thereby accelerating model pre-training and even sometimes
enhancing downstream model performance.
Related papers
- Enhancing Vision-Language Model with Unmasked Token Alignment [37.12838142681491]
This paper introduces Unmasked Token Alignment (UTA), a method that leverages existing CLIP models to further enhance its vision-language representations.
UTA trains a Vision Transformer (ViT) by aligning unmasked visual tokens to the corresponding image tokens from a frozen CLIP vision encoder, which automatically aligns the ViT model with the CLIP text encoder.
arXiv Detail & Related papers (2024-05-29T11:48:17Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - Masked Autoencoding Does Not Help Natural Language Supervision at Scale [16.277390808400828]
We investigate whether a similar approach can be effective when trained with a much larger amount of data.
We find that a combination of two state of the art approaches: masked auto-encoders, MAE and contrastive language image pre-training, CLIP provides a benefit over CLIP when trained on a corpus of 11.3M image-text pairs.
arXiv Detail & Related papers (2023-01-19T01:05:18Z) - Effective End-to-End Vision Language Pretraining with Semantic Visual
Loss [58.642954383282216]
Current vision language pretraining models are dominated by methods using region visual features extracted from object detectors.
We introduce three types of visual losses that enable much faster convergence and better finetuning accuracy.
Compared with region feature models, our end-to-end models could achieve similar or better performance on downstream tasks and run more than 10 times faster during inference.
arXiv Detail & Related papers (2023-01-18T00:22:49Z) - Scaling Language-Image Pre-training via Masking [63.36988191660858]
Fast Language-Image Pre-training (FLIP) is a simple and more efficient method for training CLIP.
Masking allows us to learn from more image-text pairs given the same wall-clock time.
FLIP dominantly outperforms CLIP counterparts trained on the same data.
arXiv Detail & Related papers (2022-12-01T18:59:57Z) - Leveraging per Image-Token Consistency for Vision-Language Pre-training [52.825150269820696]
Cross-modal masked language modeling (CMLM) is insufficient for vision-language pre-training.
We propose EPIC (lEveraging Per Image-Token Consistency for vision-language pre-training)
The proposed EPIC method is easily combined with pre-training methods.
arXiv Detail & Related papers (2022-11-20T12:10:53Z) - ProtoCLIP: Prototypical Contrastive Language Image Pretraining [12.067061175987075]
Prototypical Contrastive Language Image Pretraining (ProtoCLIP) is introduced to enhance such grouping.
ProtoCLIP sets up prototype-level discrimination between image and text spaces, which efficiently transfers higher-level structural knowledge.
ProtoCLIP is trained with an online episodic training strategy, which makes it can be scaled up to unlimited amounts of data.
arXiv Detail & Related papers (2022-06-22T11:55:53Z) - Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.