MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
- URL: http://arxiv.org/abs/2402.03766v1
- Date: Tue, 6 Feb 2024 07:16:36 GMT
- Title: MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
- Authors: Xiangxiang Chu and Limeng Qiao and Xinyu Zhang and Shuang Xu and Fei
Wei and Yang Yang and Xiaofei Sun and Yiming Hu and Xinyang Lin and Bo Zhang
and Chunhua Shen
- Abstract summary: We introduce MobileVLM V2, a family of significantly improved vision language models upon MobileVLM.
MobileVLM V2 1.7B achieves better or on-par performance on standard VLM benchmarks compared with much larger VLMs at the 3B scale.
- Score: 73.74838586081385
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: We introduce MobileVLM V2, a family of significantly improved vision language
models upon MobileVLM, which proves that a delicate orchestration of novel
architectural design, an improved training scheme tailored for mobile VLMs, and
rich high-quality dataset curation can substantially benefit VLMs' performance.
Specifically, MobileVLM V2 1.7B achieves better or on-par performance on
standard VLM benchmarks compared with much larger VLMs at the 3B scale.
Notably, our 3B model outperforms a large variety of VLMs at the 7B+ scale. Our
models will be released at https://github.com/Meituan-AutoML/MobileVLM .
Related papers
- Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance [78.48606021719206]
Mini-InternVL is a series of MLLMs with parameters ranging from 1B to 4B, which achieves 90% of the performance with only 5% of the parameters.
We develop a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks.
arXiv Detail & Related papers (2024-10-21T17:58:20Z) - TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation [32.406783380729024]
Vision-Language-Action (VLA) models have shown remarkable potential in visuomotor control and instruction comprehension through end-to-end learning processes.
Current VLA models face significant challenges: they are slow during inference and require extensive pre-training on large amounts of robotic data.
We introduce a new family of compact vision-language-action models, called TinyVLA, which offers two key advantages over existing VLA models.
arXiv Detail & Related papers (2024-09-19T07:10:18Z) - NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z) - Are Bigger Encoders Always Better in Vision Large Models? [21.797332686137203]
multimodal large language models (MLLMs) have shown strong potential in real-world applications.
The scaling trend of vision language models (VLMs) under the current mainstream paradigm has not been extensively studied.
We conduct experiments on the pretraining stage of MLLMs using different encoder sizes and large language model (LLM) sizes.
arXiv Detail & Related papers (2024-08-01T15:05:42Z) - Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models [55.267193180769794]
Mini-Gemini is a framework to enhance multi-modality Vision Language Models (VLMs)
Mini-Gemini supports a series of dense and MoE Large Language Models (LLMs) from 2B to 34B.
It is demonstrated to achieve leading performance in several zero-shot benchmarks and even surpasses the developed private models.
arXiv Detail & Related papers (2024-03-27T17:59:04Z) - TinyLLaVA: A Framework of Small-scale Large Multimodal Models [11.686023770810937]
We study the effects of different vision encoders, connection modules, language models, training data and training recipes.
Under our framework, we train a family of small-scale LMMs. Our best model, TinyLLaVA-3.1B, achieves better overall performance against existing 7B models such as LLaVA-1.5 and Qwen-VL.
arXiv Detail & Related papers (2024-02-22T05:05:30Z) - Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning.
Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z) - MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile
Devices [73.46317110474064]
MobileVLM is a competent multimodal vision language model (MMVLM) targeted to run on mobile devices.
It comprises a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion.
arXiv Detail & Related papers (2023-12-28T08:21:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.