Effective End-to-End Vision Language Pretraining with Semantic Visual
Loss
- URL: http://arxiv.org/abs/2301.07236v1
- Date: Wed, 18 Jan 2023 00:22:49 GMT
- Title: Effective End-to-End Vision Language Pretraining with Semantic Visual
Loss
- Authors: Xiaofeng Yang, Fayao Liu, Guosheng Lin
- Abstract summary: Current vision language pretraining models are dominated by methods using region visual features extracted from object detectors.
We introduce three types of visual losses that enable much faster convergence and better finetuning accuracy.
Compared with region feature models, our end-to-end models could achieve similar or better performance on downstream tasks and run more than 10 times faster during inference.
- Score: 58.642954383282216
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current vision language pretraining models are dominated by methods using
region visual features extracted from object detectors. Given their good
performance, the extract-then-process pipeline significantly restricts the
inference speed and therefore limits their real-world use cases. However,
training vision language models from raw image pixels is difficult, as the raw
image pixels give much less prior knowledge than region features. In this
paper, we systematically study how to leverage auxiliary visual pretraining
tasks to help training end-to-end vision language models. We introduce three
types of visual losses that enable much faster convergence and better
finetuning accuracy. Compared with region feature models, our end-to-end models
could achieve similar or better performance on downstream tasks and run more
than 10 times faster during inference. Compared with other end-to-end models,
our proposed method could achieve similar or better performance when pretrained
for only 10% of the pretraining GPU hours.
Related papers
- What Makes Pre-Trained Visual Representations Successful for Robust
Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture.
We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z) - Expedited Training of Visual Conditioned Language Generation via
Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models.
We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z) - ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens [75.09406436851445]
We propose a vision token pruning and merging method ELIP, to remove less influential tokens based on the supervision of language outputs.
Our experiments demonstrate that with the removal of 30$%$ vision tokens across 12 ViT layers, ELIP maintains significantly comparable performance.
arXiv Detail & Related papers (2023-09-28T05:31:07Z) - Making the Most of What You Have: Adapting Pre-trained Visual Language
Models in the Low-data Regime [23.255873641249263]
We look into task adaptation in the low-data regime, and provide a study of the existing adaptation methods for generative Visual Language Models.
We show important benefits of self-labelling, i.e. using the model's own predictions to self-improve when having access to a larger number of unlabelled images.
arXiv Detail & Related papers (2023-05-03T17:42:54Z) - Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks.
We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z) - Prompting Visual-Language Models for Efficient Video Understanding [28.754997650215486]
This paper presents a simple method to efficiently adapt one pre-trained visual-language model to novel tasks with minimal training.
To bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features.
arXiv Detail & Related papers (2021-12-08T18:58:16Z) - Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.