GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language
Pre-training
- URL: http://arxiv.org/abs/2208.04060v1
- Date: Mon, 8 Aug 2022 11:15:45 GMT
- Title: GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language
Pre-training
- Authors: Jaeseok Byun, Taebaek Hwang, Jianlong Fu, and Taesup Moon
- Abstract summary: We show that two routinely applied steps during pre-training have crucial impact on the performance of the pre-trained model.
We propose a new vision and language pre-training method, which adaptively samples mini-batches for more effective mining of hard negative samples for ITM.
Our method achieves a new state-of-the-art performance on various downstream tasks with much less computational cost.
- Score: 47.95914618851596
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most of the currently existing vision and language pre-training (VLP) methods
have mainly focused on how to extract and align vision and text features. In
contrast to the mainstream VLP methods, we highlight that two routinely applied
steps during pre-training have crucial impact on the performance of the
pre-trained model: in-batch hard negative sampling for image-text matching
(ITM) and assigning the large masking probability for the masked language
modeling (MLM). After empirically showing the unexpected effectiveness of above
two steps, we systematically devise our GRIT-VLP, which adaptively samples
mini-batches for more effective mining of hard negative samples for ITM while
maintaining the computational cost for pre-training. Our method consists of
three components: 1) GRouped mIni-baTch sampling (GRIT) strategy that collects
similar examples in a mini-batch, 2) ITC consistency loss for improving the
mining ability, and 3) enlarged masking probability for MLM. Consequently, we
show our GRIT-VLP achieves a new state-of-the-art performance on various
downstream tasks with much less computational cost. Furthermore, we demonstrate
that our model is essentially in par with ALBEF, the previous state-of-the-art,
only with one-third of training epochs on the same training data. Code is
available at https://github.com/jaeseokbyun/GRIT-VLP.
Related papers
- Scaling Laws for Predicting Downstream Performance in LLMs [75.28559015477137]
This work focuses on the pre-training loss as a more-efficient metric for performance estimation.
We extend the power law analytical function to predict domain-specific pre-training loss based on FLOPs across data sources.
We employ a two-layer neural network to model the non-linear relationship between multiple domain-specific loss and downstream performance.
arXiv Detail & Related papers (2024-10-11T04:57:48Z) - Efficient Continual Pre-training by Mitigating the Stability Gap [68.49269649759005]
We study the behavior of Large Language Models (LLMs) during continual pre-training.
We propose three effective strategies to enhance LLM performance within a fixed compute budget.
Our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% with only 40% of the original training budget.
arXiv Detail & Related papers (2024-06-21T02:28:37Z) - Unsupervised Pre-training with Language-Vision Prompts for Low-Data Instance Segmentation [105.23631749213729]
We propose a novel method for unsupervised pre-training in low-data regimes.
Inspired by the recently successful prompting technique, we introduce a new method, Unsupervised Pre-training with Language-Vision Prompts.
We show that our method can converge faster and perform better than CNN-based models in low-data regimes.
arXiv Detail & Related papers (2024-05-22T06:48:43Z) - TiMix: Text-aware Image Mixing for Effective Vision-Language
Pre-training [42.142924806184425]
Mixed data samples for cross-modal contrastive learning implicitly serve as a regularizer for the contrastive loss.
TiMix exhibits a comparable performance on downstream tasks, even with a reduced amount of training data and shorter training time, when benchmarked against existing methods.
arXiv Detail & Related papers (2023-12-14T12:02:24Z) - VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons.
We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z) - Tuning Language Models as Training Data Generators for
Augmentation-Enhanced Few-Shot Learning [30.65315081964461]
We study few-shot learning with pretrained language models (PLMs) from a different perspective.
We first tune an autoregressive PLM on the few-shot samples and then use it as a generator to synthesize a large amount of novel training samples.
Our approach FewGen achieves an overall better result across seven classification tasks of the GLUE benchmark than existing few-shot learning methods.
arXiv Detail & Related papers (2022-11-06T06:46:47Z) - Learning New Tasks from a Few Examples with Soft-Label Prototypes [18.363177410917597]
We propose a novel few-shot learning approach based on soft-label prototypes (SLPs)
We focus on learning previously unseen NLP tasks from very few examples (4, 8, 16) per class.
We experimentally demonstrate that our approach achieves superior performance on the majority of tested tasks in this data-lean setting.
arXiv Detail & Related papers (2022-10-31T16:06:48Z) - Model-Agnostic Multitask Fine-tuning for Few-shot Vision-Language
Transfer Learning [59.38343286807997]
We propose Model-Agnostic Multitask Fine-tuning (MAMF) for vision-language models on unseen tasks.
Compared with model-agnostic meta-learning (MAML), MAMF discards the bi-level optimization and uses only first-order gradients.
We show that MAMF consistently outperforms the classical fine-tuning method for few-shot transfer learning on five benchmark datasets.
arXiv Detail & Related papers (2022-03-09T17:26:53Z) - Grid-VLP: Revisiting Grid Features for Vision-Language Pre-training [27.103514548337404]
Existing approaches to vision-language pre-training rely on an object detector based on bounding boxes (regions)
In this paper, we revisit grid-based convolutional features for vision-language pre-training, skipping the expensive region-related steps.
arXiv Detail & Related papers (2021-08-21T09:57:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.