MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
- URL: http://arxiv.org/abs/2402.03766v1
- Date: Tue, 6 Feb 2024 07:16:36 GMT
- Title: MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
- Authors: Xiangxiang Chu and Limeng Qiao and Xinyu Zhang and Shuang Xu and Fei
Wei and Yang Yang and Xiaofei Sun and Yiming Hu and Xinyang Lin and Bo Zhang
and Chunhua Shen
- Abstract summary: We introduce MobileVLM V2, a family of significantly improved vision language models upon MobileVLM.
MobileVLM V2 1.7B achieves better or on-par performance on standard VLM benchmarks compared with much larger VLMs at the 3B scale.
- Score: 73.74838586081385
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: We introduce MobileVLM V2, a family of significantly improved vision language
models upon MobileVLM, which proves that a delicate orchestration of novel
architectural design, an improved training scheme tailored for mobile VLMs, and
rich high-quality dataset curation can substantially benefit VLMs' performance.
Specifically, MobileVLM V2 1.7B achieves better or on-par performance on
standard VLM benchmarks compared with much larger VLMs at the 3B scale.
Notably, our 3B model outperforms a large variety of VLMs at the 7B+ scale. Our
models will be released at https://github.com/Meituan-AutoML/MobileVLM .
Related papers
- Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models [39.706833232931245]
Foundation Vision Language Models (VLMs) exhibit strong capabilities in multi-modal representation learning, comprehension, and reasoning.
By injecting action components into the VLMs, Vision-Language-Action Models (VLAs) can be naturally formed and also show promising performance.
In this work, we disclose the key factors that significantly influence the performance of VLA and focus on answering three essential design choices.
We develop a new family of VLAs, RoboVLMs, which require very few manual designs and achieve a new state-of-the-art performance in three simulation tasks and real-world experiments.
arXiv Detail & Related papers (2024-12-18T17:07:20Z) - OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation [95.78870389271832]
The standard practice for developing contemporary MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision.
We propose OLA-VLM, the first approach distilling knowledge into the LLM's hidden representations from a set of target visual representations.
We show that OLA-VLM boosts performance by an average margin of up to 2.5% on various benchmarks, with a notable improvement of 8.7% on the Depth task in CV-Bench.
arXiv Detail & Related papers (2024-12-12T18:55:18Z) - VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models [63.27511432647797]
We propose VLsI: Verbalized Layers-to-Interactions, a new VLM family in 2B and 7B model sizes.
We validate VLsI across ten challenging vision-language benchmarks, achieving notable performance gains (11.0% for 2B and 17.4% for 7B) over GPT-4V.
arXiv Detail & Related papers (2024-12-02T18:58:25Z) - BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices [35.44626025003408]
We present BlueLM-V-3B, an algorithm and system co-design approach specifically tailored for the efficient deployment of MLLMs on mobile platforms.
BlueLM-V-3B boasts the following key highlights: Small Size: BlueLM-V-3B features a language model with 2.7B parameters and a encoder vision with 400M parameters.
Fast Speed: BlueLM-V-3B achieves a generation speed of 24.4 token/s on the MediaTek Dimensity 9300 processor with 4-bit LLM weight quantization.
arXiv Detail & Related papers (2024-11-16T00:14:51Z) - Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance [78.48606021719206]
Mini-InternVL is a series of MLLMs with parameters ranging from 1B to 4B, which achieves 90% of the performance with only 5% of the parameters.
We develop a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks.
arXiv Detail & Related papers (2024-10-21T17:58:20Z) - VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks [60.5257456681402]
We study the potential for building universal embeddings capable of handling a wide range of downstream tasks.
We build a series of VLM2Vec models on SoTA VLMs like Phi-3.5-V, LLaVA-1.6 and evaluate them on MMEB's evaluation split.
Our results show that VLM2Vec achieves an absolute average improvement of 10% to 20% over existing multimodal embedding models.
arXiv Detail & Related papers (2024-10-07T16:14:05Z) - MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile
Devices [73.46317110474064]
MobileVLM is a competent multimodal vision language model (MMVLM) targeted to run on mobile devices.
It comprises a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion.
arXiv Detail & Related papers (2023-12-28T08:21:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.