Related papers: Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

URL: http://arxiv.org/abs/2405.09215v3
Date: Thu, 20 Jun 2024 07:31:13 GMT
Title: Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model
Authors: Wanting Xu, Yang Liu, Langping He, Xucheng Huang, Ling Jiang,
Abstract summary: We introduce Xmodel-VLM, a cutting-edge multimodal vision language model. It is designed for efficient deployment on consumer GPU servers.
Score: 7.082567506213992
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce Xmodel-VLM, a cutting-edge multimodal vision language model. It is designed for efficient deployment on consumer GPU servers. Our work directly confronts a pivotal industry issue by grappling with the prohibitive service costs that hinder the broad adoption of large-scale multimodal systems. Through rigorous training, we have developed a 1B-scale language model from the ground up, employing the LLaVA paradigm for modal alignment. The result, which we call Xmodel-VLM, is a lightweight yet powerful multimodal vision language model. Extensive testing across numerous classic multimodal benchmarks has revealed that despite its smaller size and faster execution, Xmodel-VLM delivers performance comparable to that of larger models. Our model checkpoints and code are publicly available on GitHub at https://github.com/XiaoduoAILab/XmodelVLM.

Related papers

Eve: Efficient Multimodal Vision Language Models with Elastic Visual Experts [37.81475180129456]
We introduce the innovative framework of Efficient Vision Language Models with Elastic Visual Experts (Eve) By strategically incorporating visual expertise at multiple stages of training, Eve strikes a balance between preserving linguistic abilities and augmenting multimodal capabilities. Eve distinctly outperforms in language benchmarks and achieves state-of-the-art results 68.87% in VLM Benchmarks.
arXiv Detail & Related papers (2025-01-08T07:42:54Z)
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling [128.24325909395188]
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0. InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet. We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems.
arXiv Detail & Related papers (2024-12-06T18:57:08Z)
Liquid: Language Models are Scalable and Unified Multi-modal Generators [112.71734051183726]
Liquid is an auto-regressive generation paradigm that seamlessly integrates visual comprehension and generation. Unlike previous multimodal large language model (MLLM), Liquid achieves this integration using a single large language model. For the first time, Liquid uncovers a scaling law that performance drop unavoidably brought by the unified training of visual and language tasks.
arXiv Detail & Related papers (2024-12-05T16:48:16Z)
Xmodel-LM Technical Report [13.451816134545163]
Xmodel-LM is a compact and efficient 1.1B language model pre-trained on around 2 trillion tokens. It exhibits remarkable performance despite its smaller size.
arXiv Detail & Related papers (2024-06-05T02:12:06Z)
Libra: Building Decoupled Vision System on Large Language Models [63.28088885230901]
We introduce Libra, a prototype model with a decoupled vision system on a large language model (LLM) The decoupled vision system decouples inner-modal modeling and cross-modal interaction, yielding unique visual information modeling and effective cross-modal comprehension.
arXiv Detail & Related papers (2024-05-16T14:34:44Z)
VL-Mamba: Exploring State Space Models for Multimodal Learning [22.701028299912398]
In this work, we propose VL-Mamba, a multimodal large language model based on state space models. Specifically, we first replace the transformer-based backbone language model such as LLama or Vicuna with the pre-trained Mamba language model.
arXiv Detail & Related papers (2024-03-20T13:48:50Z)
When Do We Not Need Larger Vision Models? [55.957626371697785]
Scaling up the size of vision models has been the de facto standard to obtain more powerful visual representations. We demonstrate the power of Scaling on Scales (S$2$), whereby a pre-trained and frozen smaller vision model can outperform larger models. We release a Python package that can apply S$2$ on any vision model with one line of code.
arXiv Detail & Related papers (2024-03-19T17:58:39Z)
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices [73.46317110474064]
MobileVLM is a competent multimodal vision language model (MMVLM) targeted to run on mobile devices. It comprises a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion.
arXiv Detail & Related papers (2023-12-28T08:21:24Z)
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks [92.03764152132315]
We design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters. This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks. It has powerful visual capabilities and can be a good alternative to the ViT-22B.
arXiv Detail & Related papers (2023-12-21T18:59:31Z)
UnIVAL: Unified Model for Image, Video, Audio and Language Tasks [105.77733287326308]
UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model. Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning. Thanks to the unified model, we propose a novel study on multimodal model merging via weight generalization.
arXiv Detail & Related papers (2023-07-30T09:48:36Z)
eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception. Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency. We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z)
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks [38.05496300873095]
Vision language pre-training aims to learn alignments between vision and language from a large amount of data. We propose to learn multi-grained vision language alignments by a unified pre-training framework. X$2$-VLM is able to learn unlimited visual concepts associated with diverse text descriptions.
arXiv Detail & Related papers (2022-11-22T16:48:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.