VITA: Towards Open-Source Interactive Omni Multimodal LLM
- URL: http://arxiv.org/abs/2408.05211v2
- Date: Tue, 10 Sep 2024 13:21:08 GMT
- Title: VITA: Towards Open-Source Interactive Omni Multimodal LLM
- Authors: Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Meng Zhao, Yifan Zhang, Shaoqi Dong, Xiong Wang, Di Yin, Long Ma, Xiawu Zheng, Ran He, Rongrong Ji, Yunsheng Wu, Caifeng Shan, Xing Sun,
- Abstract summary: We introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM)
We endow the language model with visual and audio capabilities through two-stage multi-task learning.
VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding.
- Score: 104.52782565106033
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research. Project Page: https://vita-home.github.io.
Related papers
- Ocean-omni: To Understand the World with Omni-modality [28.306965534325904]
Ocean-omni is the first open-source 7B Multimodal Large Language Model (MLLM)
We introduce Ocean-omni, the first open-source 7B Multimodal Large Language Model (MLLM)
arXiv Detail & Related papers (2024-10-11T06:44:31Z) - Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond [51.141270065306514]
This tutorial aims to equip researchers, practitioners, and newcomers with the knowledge and skills to leverage multimodal AI.
We will cover the latest multimodal datasets and pretrained models, including those beyond vision and language.
Hands-on laboratories will offer practical experience with state-of-the-art multimodal models.
arXiv Detail & Related papers (2024-10-08T01:41:56Z) - MIO: A Foundation Model on Multimodal Tokens [74.85153216521945]
We introduce MIO, a novel foundation model built on multimodal tokens.
MIO is capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner.
arXiv Detail & Related papers (2024-09-26T09:57:16Z) - mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with
Modality Collaboration [74.31268379055201]
mPLUG-Owl2 is a versatile multi-modal large language model.
It effectively leverages modality collaboration to improve performance in both text and multi-modal tasks.
arXiv Detail & Related papers (2023-11-07T14:21:29Z) - TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild [102.93338424976959]
We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved instruction-following capabilities.
Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model.
To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models.
arXiv Detail & Related papers (2023-09-14T15:34:01Z) - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark.
Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs.
We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z) - ChatBridge: Bridging Modalities with Large Language Model as a Language
Catalyst [24.517389691825667]
ChatBridge is a novel multimodal language model that leverages the expressive capabilities of language to bridge the gap between various modalities.
All codes, data, and models of ChatBridge will be open-sourced.
arXiv Detail & Related papers (2023-05-25T14:34:08Z) - Large-scale Bilingual Language-Image Contrastive Learning [17.19890778916312]
We collect 1.1 billion image-text pairs (708 million Korean and 476 million English) and train a bilingual multimodal model named KELIP.
We introduce simple yet effective training schemes, including MAE pre-training and multi-crop augmentation.
Experiments demonstrate that a model trained with such training schemes shows competitive performance in both languages.
arXiv Detail & Related papers (2022-03-28T03:02:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.