MiniVLM: A Smaller and Faster Vision-Language Model
- URL: http://arxiv.org/abs/2012.06946v1
- Date: Sun, 13 Dec 2020 03:02:06 GMT
- Title: MiniVLM: A Smaller and Faster Vision-Language Model
- Authors: Jianfeng Wang and Xiaowei Hu and Pengchuan Zhang and Xiujun Li and
Lijuan Wang and Lei Zhang and Jianfeng Gao and Zicheng Liu
- Abstract summary: MiniVLM consists of two modules, a vision feature extractor and a vision-language fusion module.
MiniVLM reduces the model size by $73%$ and the inference time cost by $94%$.
- Score: 76.35880443015493
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent vision-language (VL) studies have shown remarkable progress by
learning generic representations from massive image-text pairs with transformer
models and then fine-tuning on downstream VL tasks. While existing research has
been focused on achieving high accuracy with large pre-trained models, building
a lightweight model is of great value in practice but is less explored. In this
paper, we propose a smaller and faster VL model, MiniVLM, which can be
finetuned with good performance on various downstream tasks like its larger
counterpart. MiniVLM consists of two modules, a vision feature extractor and a
transformer-based vision-language fusion module. We design a Two-stage
Efficient feature Extractor (TEE), inspired by the one-stage EfficientDet
network, to significantly reduce the time cost of visual feature extraction by
$95\%$, compared to a baseline model. We adopt the MiniLM structure to reduce
the computation cost of the transformer module after comparing different
compact BERT models. In addition, we improve the MiniVLM pre-training by adding
$7M$ Open Images data, which are pseudo-labeled by a state-of-the-art
captioning model. We also pre-train with high-quality image tags obtained from
a strong tagging model to enhance cross-modality alignment. The large models
are used offline without adding any overhead in fine-tuning and inference. With
the above design choices, our MiniVLM reduces the model size by $73\%$ and the
inference time cost by $94\%$ while being able to retain $94-97\%$ of the
accuracy on multiple VL tasks. We hope that MiniVLM helps ease the use of the
state-of-the-art VL research for on-the-edge applications.
Related papers
- Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model [7.082567506213992]
We introduce Xmodel-VLM, a cutting-edge multimodal vision language model.
It is designed for efficient deployment on consumer GPU servers.
arXiv Detail & Related papers (2024-05-15T09:47:59Z) - When Do We Not Need Larger Vision Models? [55.957626371697785]
Scaling up the size of vision models has been the de facto standard to obtain more powerful visual representations.
We demonstrate the power of Scaling on Scales (S$2$), whereby a pre-trained and frozen smaller vision model can outperform larger models.
We release a Python package that can apply S$2$ on any vision model with one line of code.
arXiv Detail & Related papers (2024-03-19T17:58:39Z) - An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs.
We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z) - DIME-FM: DIstilling Multimodal and Efficient Foundation Models [72.1900621000677]
Large Vision-Language Foundation Models (VLFM) are trained on large-scale datasets of image-caption pairs.
We introduce a new distillation mechanism (DIME-FM) that allows us to transfer the knowledge contained in large VLFMs to smaller, customized foundation models.
The resulting model "Distill-ViT-B/32" rivals the CLIP-ViT-B/32 model pre-trained on its private WiT dataset.
arXiv Detail & Related papers (2023-03-31T17:47:23Z) - An Empirical Study of Training End-to-End Vision-and-Language
Transformers [50.23532518166621]
We present METER(textbfMultimodal textbfEnd-to-end textbfTransformtextbfER), through which we investigate how to design and pre-train a fully transformer-based VL model.
Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion (e.g., merged attention vs. co-
arXiv Detail & Related papers (2021-11-03T17:55:36Z) - VinVL: Revisiting Visual Representations in Vision-Language Models [96.39332942534368]
We develop an improved object detection model to provide object-centric representations of images.
New visual features significantly improve the performance across all vision language (VL) tasks.
We will release the new object detection model to public.
arXiv Detail & Related papers (2021-01-02T23:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.