MiniVLM: A Smaller and Faster Vision-Language Model
- URL: http://arxiv.org/abs/2012.06946v1
- Date: Sun, 13 Dec 2020 03:02:06 GMT
- Title: MiniVLM: A Smaller and Faster Vision-Language Model
- Authors: Jianfeng Wang and Xiaowei Hu and Pengchuan Zhang and Xiujun Li and
Lijuan Wang and Lei Zhang and Jianfeng Gao and Zicheng Liu
- Abstract summary: MiniVLM consists of two modules, a vision feature extractor and a vision-language fusion module.
MiniVLM reduces the model size by $73%$ and the inference time cost by $94%$.
- Score: 76.35880443015493
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent vision-language (VL) studies have shown remarkable progress by
learning generic representations from massive image-text pairs with transformer
models and then fine-tuning on downstream VL tasks. While existing research has
been focused on achieving high accuracy with large pre-trained models, building
a lightweight model is of great value in practice but is less explored. In this
paper, we propose a smaller and faster VL model, MiniVLM, which can be
finetuned with good performance on various downstream tasks like its larger
counterpart. MiniVLM consists of two modules, a vision feature extractor and a
transformer-based vision-language fusion module. We design a Two-stage
Efficient feature Extractor (TEE), inspired by the one-stage EfficientDet
network, to significantly reduce the time cost of visual feature extraction by
$95\%$, compared to a baseline model. We adopt the MiniLM structure to reduce
the computation cost of the transformer module after comparing different
compact BERT models. In addition, we improve the MiniVLM pre-training by adding
$7M$ Open Images data, which are pseudo-labeled by a state-of-the-art
captioning model. We also pre-train with high-quality image tags obtained from
a strong tagging model to enhance cross-modality alignment. The large models
are used offline without adding any overhead in fine-tuning and inference. With
the above design choices, our MiniVLM reduces the model size by $73\%$ and the
inference time cost by $94\%$ while being able to retain $94-97\%$ of the
accuracy on multiple VL tasks. We hope that MiniVLM helps ease the use of the
state-of-the-art VL research for on-the-edge applications.
Related papers
- Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance [78.48606021719206]
Mini-InternVL is a series of MLLMs with parameters ranging from 1B to 4B, which achieves 90% of the performance with only 5% of the parameters.
We develop a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks.
arXiv Detail & Related papers (2024-10-21T17:58:20Z) - TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation [32.406783380729024]
Vision-Language-Action (VLA) models have shown remarkable potential in visuomotor control and instruction comprehension through end-to-end learning processes.
Current VLA models face significant challenges: they are slow during inference and require extensive pre-training on large amounts of robotic data.
We introduce a new family of compact vision-language-action models, called TinyVLA, which offers two key advantages over existing VLA models.
arXiv Detail & Related papers (2024-09-19T07:10:18Z) - Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution [82.38677987249348]
We present the Qwen2-VL Series, which redefines the conventional predetermined-resolution approach in visual processing.
Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens.
The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos.
arXiv Detail & Related papers (2024-09-18T17:59:32Z) - An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs.
We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z) - DIME-FM: DIstilling Multimodal and Efficient Foundation Models [72.1900621000677]
Large Vision-Language Foundation Models (VLFM) are trained on large-scale datasets of image-caption pairs.
We introduce a new distillation mechanism (DIME-FM) that allows us to transfer the knowledge contained in large VLFMs to smaller, customized foundation models.
The resulting model "Distill-ViT-B/32" rivals the CLIP-ViT-B/32 model pre-trained on its private WiT dataset.
arXiv Detail & Related papers (2023-03-31T17:47:23Z) - An Empirical Study of Training End-to-End Vision-and-Language
Transformers [50.23532518166621]
We present METER(textbfMultimodal textbfEnd-to-end textbfTransformtextbfER), through which we investigate how to design and pre-train a fully transformer-based VL model.
Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion (e.g., merged attention vs. co-
arXiv Detail & Related papers (2021-11-03T17:55:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.