Related papers: MiniVLM: A Smaller and Faster Vision-Language Model

MiniVLM: A Smaller and Faster Vision-Language Model

URL: http://arxiv.org/abs/2012.06946v1
Date: Sun, 13 Dec 2020 03:02:06 GMT
Title: MiniVLM: A Smaller and Faster Vision-Language Model
Authors: Jianfeng Wang and Xiaowei Hu and Pengchuan Zhang and Xiujun Li and Lijuan Wang and Lei Zhang and Jianfeng Gao and Zicheng Liu
Abstract summary: MiniVLM consists of two modules, a vision feature extractor and a vision-language fusion module. MiniVLM reduces the model size by $73%$ and the inference time cost by $94%$.
Score: 76.35880443015493
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent vision-language (VL) studies have shown remarkable progress by learning generic representations from massive image-text pairs with transformer models and then fine-tuning on downstream VL tasks. While existing research has been focused on achieving high accuracy with large pre-trained models, building a lightweight model is of great value in practice but is less explored. In this paper, we propose a smaller and faster VL model, MiniVLM, which can be finetuned with good performance on various downstream tasks like its larger counterpart. MiniVLM consists of two modules, a vision feature extractor and a transformer-based vision-language fusion module. We design a Two-stage Efficient feature Extractor (TEE), inspired by the one-stage EfficientDet network, to significantly reduce the time cost of visual feature extraction by $95\%$, compared to a baseline model. We adopt the MiniLM structure to reduce the computation cost of the transformer module after comparing different compact BERT models. In addition, we improve the MiniVLM pre-training by adding $7M$ Open Images data, which are pseudo-labeled by a state-of-the-art captioning model. We also pre-train with high-quality image tags obtained from a strong tagging model to enhance cross-modality alignment. The large models are used offline without adding any overhead in fine-tuning and inference. With the above design choices, our MiniVLM reduces the model size by $73\%$ and the inference time cost by $94\%$ while being able to retain $94-97\%$ of the accuracy on multiple VL tasks. We hope that MiniVLM helps ease the use of the state-of-the-art VL research for on-the-edge applications.

Related papers

SmolVLM: Redefining small and efficient multimodal models [8.849350918179752]
We introduce SmolVLM, a series of compact multimodal models specifically engineered for resource-efficient inference. We identify key design choices that yield substantial performance gains on image and video tasks with minimal memory footprints. Our results emphasize that strategic architectural optimizations, aggressive yet efficient tokenization, and carefully curated training data significantly enhance multimodal performance.
arXiv Detail & Related papers (2025-04-07T17:58:57Z)
FastVLM: Efficient Vision Encoding for Vision Language Models [22.41836943083826]
We introduce FastVLM, a model that achieves an optimized trade-off between latency, model size and accuracy. FastVLM incorporates FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images.
arXiv Detail & Related papers (2024-12-17T20:09:55Z)
Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance [78.48606021719206]
Mini-InternVL is a series of MLLMs with parameters ranging from 1B to 4B, which achieves 90% of the performance with only 5% of the parameters. We develop a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks.
arXiv Detail & Related papers (2024-10-21T17:58:20Z)
TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation [32.406783380729024]
Vision-Language-Action (VLA) models have shown remarkable potential in visuomotor control and instruction comprehension through end-to-end learning processes. Current VLA models face significant challenges: they are slow during inference and require extensive pre-training on large amounts of robotic data. We introduce a new family of compact vision-language-action models, called TinyVLA, which offers two key advantages over existing VLA models.
arXiv Detail & Related papers (2024-09-19T07:10:18Z)
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution [82.38677987249348]
We present the Qwen2-VL Series, which redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos.
arXiv Detail & Related papers (2024-09-18T17:59:32Z)
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs. We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z)
DIME-FM: DIstilling Multimodal and Efficient Foundation Models [72.1900621000677]
Large Vision-Language Foundation Models (VLFM) are trained on large-scale datasets of image-caption pairs. We introduce a new distillation mechanism (DIME-FM) that allows us to transfer the knowledge contained in large VLFMs to smaller, customized foundation models. The resulting model "Distill-ViT-B/32" rivals the CLIP-ViT-B/32 model pre-trained on its private WiT dataset.
arXiv Detail & Related papers (2023-03-31T17:47:23Z)
An Empirical Study of Training End-to-End Vision-and-Language Transformers [50.23532518166621]
We present METER(textbfMultimodal textbfEnd-to-end textbfTransformtextbfER), through which we investigate how to design and pre-train a fully transformer-based VL model. Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion (e.g., merged attention vs. co-
arXiv Detail & Related papers (2021-11-03T17:55:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.