Related papers: MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices

MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices

URL: http://arxiv.org/abs/2312.16886v2
Date: Sat, 30 Dec 2023 04:59:21 GMT
Title: MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices
Authors: Xiangxiang Chu and Limeng Qiao and Xinyang Lin and Shuang Xu and Yang Yang and Yiming Hu and Fei Wei and Xinyu Zhang and Bo Zhang and Xiaolin Wei and Chunhua Shen
Abstract summary: MobileVLM is a competent multimodal vision language model (MMVLM) targeted to run on mobile devices. It comprises a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion.
Score: 73.46317110474064
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present MobileVLM, a competent multimodal vision language model (MMVLM) targeted to run on mobile devices. It is an amalgamation of a myriad of architectural designs and techniques that are mobile-oriented, which comprises a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion, cross-modality interaction via an efficient projector. We evaluate MobileVLM on several typical VLM benchmarks. Our models demonstrate on par performance compared with a few much larger models. More importantly, we measure the inference speed on both a Qualcomm Snapdragon 888 CPU and an NVIDIA Jeston Orin GPU, and we obtain state-of-the-art performance of 21.5 tokens and 65.3 tokens per second, respectively. Our code will be made available at: https://github.com/Meituan-AutoML/MobileVLM.

Related papers

Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model [60.171601995737646]
Mobile-VideoGPT is an efficient multimodal framework for video understanding. It consists of lightweight dual visual encoders, efficient projectors, and a small language model (SLM) Our results show that Mobile-VideoGPT-0.5B can generate up to 46 tokens per second.
arXiv Detail & Related papers (2025-03-27T17:59:58Z)
Mordal: Automated Pretrained Model Selection for Vision Language Models [4.339232569078834]
Mordal is an automated multimodal model search framework that efficiently finds the best VLM for a user-defined task without manual intervention. Our evaluation shows that Mordal can find the best VLM for a given problem using up to $8.9times$--$11.6times$ lower GPU hours than grid search.
arXiv Detail & Related papers (2025-02-01T00:41:29Z)
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices [35.44626025003408]
We present BlueLM-V-3B, an algorithm and system co-design approach specifically tailored for the efficient deployment of MLLMs on mobile platforms. BlueLM-V-3B boasts the following key highlights: Small Size: BlueLM-V-3B features a language model with 2.7B parameters and a encoder vision with 400M parameters. Fast Speed: BlueLM-V-3B achieves a generation speed of 24.4 token/s on the MediaTek Dimensity 9300 processor with 4-bit LLM weight quantization.
arXiv Detail & Related papers (2024-11-16T00:14:51Z)
H2OVL-Mississippi Vision Language Models Technical Report [4.070560738863018]
We present H2OVL-Mississippi, a pair of small vision-language models trained on 37 million image-text pairs. H2OVL-Mississippi-0.8B is a tiny model with 0.8 billion parameters that specializes in text recognition. We are releasing H2OVL-Mississippi-2B, a 2 billion parameter model for general use cases, exhibiting highly competitive metrics.
arXiv Detail & Related papers (2024-10-17T14:46:34Z)
NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks. We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z)
CogVLM2: Visual Language Models for Image and Video Understanding [69.361109860391]
We propose the CogVLM2 family, a new generation of visual language models for image and video understanding. As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages. As a video understanding model, CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction.
arXiv Detail & Related papers (2024-08-29T12:59:12Z)
Dense Connector for MLLMs [89.50595155217108]
We introduce the Dense Connector - a plug-and-play vision-language connector that significantly enhances existing MLLMs. Building on this, we also propose the Efficient Dense Connector, which achieves performance comparable to LLaVA-v1.5 with only 25% of the visual tokens. Our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well.
arXiv Detail & Related papers (2024-05-22T16:25:03Z)
Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model [7.082567506213992]
We introduce Xmodel-VLM, a cutting-edge multimodal vision language model. It is designed for efficient deployment on consumer GPU servers.
arXiv Detail & Related papers (2024-05-15T09:47:59Z)
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model [73.74838586081385]
We introduce MobileVLM V2, a family of significantly improved vision language models upon MobileVLM. MobileVLM V2 1.7B achieves better or on-par performance on standard VLM benchmarks compared with much larger VLMs at the 3B scale.
arXiv Detail & Related papers (2024-02-06T07:16:36Z)
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks [92.03764152132315]
We design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters. This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks. It has powerful visual capabilities and can be a good alternative to the ViT-22B.
arXiv Detail & Related papers (2023-12-21T18:59:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.