Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model
- URL: http://arxiv.org/abs/2405.09215v3
- Date: Thu, 20 Jun 2024 07:31:13 GMT
- Title: Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model
- Authors: Wanting Xu, Yang Liu, Langping He, Xucheng Huang, Ling Jiang,
- Abstract summary: We introduce Xmodel-VLM, a cutting-edge multimodal vision language model.
It is designed for efficient deployment on consumer GPU servers.
- Score: 7.082567506213992
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Xmodel-VLM, a cutting-edge multimodal vision language model. It is designed for efficient deployment on consumer GPU servers. Our work directly confronts a pivotal industry issue by grappling with the prohibitive service costs that hinder the broad adoption of large-scale multimodal systems. Through rigorous training, we have developed a 1B-scale language model from the ground up, employing the LLaVA paradigm for modal alignment. The result, which we call Xmodel-VLM, is a lightweight yet powerful multimodal vision language model. Extensive testing across numerous classic multimodal benchmarks has revealed that despite its smaller size and faster execution, Xmodel-VLM delivers performance comparable to that of larger models. Our model checkpoints and code are publicly available on GitHub at https://github.com/XiaoduoAILab/XmodelVLM.
Related papers
- Eve: Efficient Multimodal Vision Language Models with Elastic Visual Experts [37.81475180129456]
We introduce the innovative framework of Efficient Vision Language Models with Elastic Visual Experts (Eve)
By strategically incorporating visual expertise at multiple stages of training, Eve strikes a balance between preserving linguistic abilities and augmenting multimodal capabilities.
Eve distinctly outperforms in language benchmarks and achieves state-of-the-art results 68.87% in VLM Benchmarks.
arXiv Detail & Related papers (2025-01-08T07:42:54Z) - Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling [128.24325909395188]
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0.
InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet.
We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems.
arXiv Detail & Related papers (2024-12-06T18:57:08Z) - Liquid: Language Models are Scalable Multi-modal Generators [112.71734051183726]
Liquid is an auto-regressive generation paradigm that seamlessly integrates visual comprehension and generation.
Unlike previous multimodal large language model (MLLM), Liquid achieves this integration using a single large language model.
For the first time, Liquid uncovers a scaling law that performance drop unavoidably brought by the unified training of visual and language tasks.
arXiv Detail & Related papers (2024-12-05T16:48:16Z) - Xmodel-LM Technical Report [13.451816134545163]
Xmodel-LM is a compact and efficient 1.1B language model pre-trained on around 2 trillion tokens.
It exhibits remarkable performance despite its smaller size.
arXiv Detail & Related papers (2024-06-05T02:12:06Z) - Libra: Building Decoupled Vision System on Large Language Models [63.28088885230901]
We introduce Libra, a prototype model with a decoupled vision system on a large language model (LLM)
The decoupled vision system decouples inner-modal modeling and cross-modal interaction, yielding unique visual information modeling and effective cross-modal comprehension.
arXiv Detail & Related papers (2024-05-16T14:34:44Z) - VL-Mamba: Exploring State Space Models for Multimodal Learning [22.701028299912398]
In this work, we propose VL-Mamba, a multimodal large language model based on state space models.
Specifically, we first replace the transformer-based backbone language model such as LLama or Vicuna with the pre-trained Mamba language model.
arXiv Detail & Related papers (2024-03-20T13:48:50Z) - MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile
Devices [73.46317110474064]
MobileVLM is a competent multimodal vision language model (MMVLM) targeted to run on mobile devices.
It comprises a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion.
arXiv Detail & Related papers (2023-12-28T08:21:24Z) - InternVL: Scaling up Vision Foundation Models and Aligning for Generic
Visual-Linguistic Tasks [92.03764152132315]
We design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters.
This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks.
It has powerful visual capabilities and can be a good alternative to the ViT-22B.
arXiv Detail & Related papers (2023-12-21T18:59:31Z) - UnIVAL: Unified Model for Image, Video, Audio and Language Tasks [105.77733287326308]
UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model.
Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning.
Thanks to the unified model, we propose a novel study on multimodal model merging via weight generalization.
arXiv Detail & Related papers (2023-07-30T09:48:36Z) - X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks [38.05496300873095]
Vision language pre-training aims to learn alignments between vision and language from a large amount of data.
We propose to learn multi-grained vision language alignments by a unified pre-training framework.
X$2$-VLM is able to learn unlimited visual concepts associated with diverse text descriptions.
arXiv Detail & Related papers (2022-11-22T16:48:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.