MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile
Devices
- URL: http://arxiv.org/abs/2312.16886v2
- Date: Sat, 30 Dec 2023 04:59:21 GMT
- Title: MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile
Devices
- Authors: Xiangxiang Chu and Limeng Qiao and Xinyang Lin and Shuang Xu and Yang
Yang and Yiming Hu and Fei Wei and Xinyu Zhang and Bo Zhang and Xiaolin Wei
and Chunhua Shen
- Abstract summary: MobileVLM is a competent multimodal vision language model (MMVLM) targeted to run on mobile devices.
It comprises a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion.
- Score: 73.46317110474064
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present MobileVLM, a competent multimodal vision language model (MMVLM)
targeted to run on mobile devices. It is an amalgamation of a myriad of
architectural designs and techniques that are mobile-oriented, which comprises
a set of language models at the scale of 1.4B and 2.7B parameters, trained from
scratch, a multimodal vision model that is pre-trained in the CLIP fashion,
cross-modality interaction via an efficient projector. We evaluate MobileVLM on
several typical VLM benchmarks. Our models demonstrate on par performance
compared with a few much larger models. More importantly, we measure the
inference speed on both a Qualcomm Snapdragon 888 CPU and an NVIDIA Jeston Orin
GPU, and we obtain state-of-the-art performance of 21.5 tokens and 65.3 tokens
per second, respectively. Our code will be made available at:
https://github.com/Meituan-AutoML/MobileVLM.
Related papers
- H2OVL-Mississippi Vision Language Models Technical Report [4.070560738863018]
We present H2OVL-Mississippi, a pair of small vision-language models trained on 37 million image-text pairs.
H2OVL-Mississippi-0.8B is a tiny model with 0.8 billion parameters that specializes in text recognition.
We are releasing H2OVL-Mississippi-2B, a 2 billion parameter model for general use cases, exhibiting highly competitive metrics.
arXiv Detail & Related papers (2024-10-17T14:46:34Z) - NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z) - CogVLM2: Visual Language Models for Image and Video Understanding [69.361109860391]
We propose the CogVLM2 family, a new generation of visual language models for image and video understanding.
As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages.
As a video understanding model, CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction.
arXiv Detail & Related papers (2024-08-29T12:59:12Z) - Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model [7.082567506213992]
We introduce Xmodel-VLM, a cutting-edge multimodal vision language model.
It is designed for efficient deployment on consumer GPU servers.
arXiv Detail & Related papers (2024-05-15T09:47:59Z) - MobileVLM V2: Faster and Stronger Baseline for Vision Language Model [73.74838586081385]
We introduce MobileVLM V2, a family of significantly improved vision language models upon MobileVLM.
MobileVLM V2 1.7B achieves better or on-par performance on standard VLM benchmarks compared with much larger VLMs at the 3B scale.
arXiv Detail & Related papers (2024-02-06T07:16:36Z) - InternVL: Scaling up Vision Foundation Models and Aligning for Generic
Visual-Linguistic Tasks [92.03764152132315]
We design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters.
This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks.
It has powerful visual capabilities and can be a good alternative to the ViT-22B.
arXiv Detail & Related papers (2023-12-21T18:59:31Z) - Mobile V-MoEs: Scaling Down Vision Transformers via Sparse
Mixture-of-Experts [55.282613372420805]
We explore the use of sparse MoEs to scale-down Vision Transformers (ViTs) to make them more attractive for resource-constrained vision applications.
We propose a simplified and mobile-friendly MoE design where entire images rather than individual patches are routed to the experts.
We empirically show that our sparse Mobile Vision MoEs (V-MoEs) can achieve a better trade-off between performance and efficiency than the corresponding dense ViTs.
arXiv Detail & Related papers (2023-09-08T14:24:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.