Related papers: Baichuan-Omni Technical Report

Baichuan-Omni Technical Report

URL: http://arxiv.org/abs/2410.08565v4
Date: Fri, 27 Dec 2024 14:19:55 GMT
Title: Baichuan-Omni Technical Report
Authors: Yadong Li, Haoze Sun, Mingan Lin, Tianpeng Li, Guosheng Dong, Tao Zhang, Bowen Ding, Wei Song, Zhenglin Cheng, Yuqi Huo, Song Chen, Xu Li, Da Pan, Shusen Zhang, Xin Wu, Zheng Liang, Jun Liu, Tao Zhang, Keer Lu, Yaqi Zhao, Yanjun Shen, Fan Yang, Kaicheng Yu, Tao Lin, Jianhua Xu, Zenan Zhou, Weipeng Chen,
Abstract summary: We introduce Baichuan-omni, the first open-source 7B Multimodal Large Language Model (MLLM)<n>Baichuan-omni is adept at concurrently processing and analyzing modalities of image, video, audio, and text.
Score: 28.306965534325904
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The salient multimodal capabilities and interactive experience of GPT-4o highlight its critical role in practical applications, yet it lacks a high-performing open-source counterpart. In this paper, we introduce Baichuan-omni, the first open-source 7B Multimodal Large Language Model (MLLM) adept at concurrently processing and analyzing modalities of image, video, audio, and text, while delivering an advanced multimodal interactive experience and strong performance. We propose an effective multimodal training schema starting with 7B model and proceeding through two stages of multimodal alignment and multitask fine-tuning across audio, image, video, and text modal. This approach equips the language model with the ability to handle visual and audio data effectively. Demonstrating strong performance across various omni-modal and multimodal benchmarks, we aim for this contribution to serve as a competitive baseline for the open-source community in advancing multimodal understanding and real-time interaction.

Related papers

InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue [35.99134148462425]
We introduce Interactive Omni, a unified and open-source omni-modal large language model for audio-visual multi-turn interaction.<n>To achieve this, we integrate the vision encoder, audio encoder, large language model, and speech decoder into a unified model for understanding and generation tasks.<n>We design a multi-stage training strategy to ensure robust cross-modal capabilities, including pre-training for omni-modal understanding.
arXiv Detail & Related papers (2025-10-15T16:52:48Z)
Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond [51.141270065306514]
This tutorial aims to equip researchers, practitioners, and newcomers with the knowledge and skills to leverage multimodal AI. We will cover the latest multimodal datasets and pretrained models, including those beyond vision and language. Hands-on laboratories will offer practical experience with state-of-the-art multimodal models.
arXiv Detail & Related papers (2024-10-08T01:41:56Z)
MIO: A Foundation Model on Multimodal Tokens [74.85153216521945]
We introduce MIO, a novel foundation model built on multimodal tokens. MIO is capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner.
arXiv Detail & Related papers (2024-09-26T09:57:16Z)
VITA: Towards Open-Source Interactive Omni Multimodal LLM [104.52782565106033]
We introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) We endow the language model with visual and audio capabilities through two-stage multi-task learning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding.
arXiv Detail & Related papers (2024-08-09T17:59:49Z)
S3: A Simple Strong Sample-effective Multimodal Dialog System [61.31055673156622]
We present a conceptually simple yet powerful baseline for the multimodal dialog task, an S3 model, that achieves near state-of-the-art results. The system is based on a pre-trained large language model, pre-trained modality encoders for image and audio, and a trainable modality projector.
arXiv Detail & Related papers (2024-06-26T12:45:43Z)
POEM: Interactive Prompt Optimization for Enhancing Multimodal Reasoning of Large Language Models [28.072184039405784]
We present POEM, a visual analytics system to facilitate efficient prompt engineering for large language models (LLMs) The system enables users to explore the interaction patterns across modalities at varying levels of detail for a comprehensive understanding of the multimodal knowledge elicited by various prompts.
arXiv Detail & Related papers (2024-06-06T08:21:30Z)
MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples [63.78384552789171]
This paper introduces Multi-Modal In-Context Tuning (MMICT), a novel multi-modal fine-tuning paradigm. We propose the Multi-Modal Hub (M-Hub), a unified module that captures various multi-modal features according to different inputs and objectives. Based on M-Hub, MMICT enables MM-LLMs to learn from in-context visual-guided textual features and subsequently generate outputs conditioned on the textual-guided visual features.
arXiv Detail & Related papers (2023-12-11T13:11:04Z)
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration [74.31268379055201]
mPLUG-Owl2 is a versatile multi-modal large language model. It effectively leverages modality collaboration to improve performance in both text and multi-modal tasks.
arXiv Detail & Related papers (2023-11-07T14:21:29Z)
DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention [55.2825684201129]
DeepSpeed-VisualChat is designed to optimize Large Language Models (LLMs) by incorporating multi-modal capabilities. Our framework is notable for (1) its open-source support for multi-round and multi-image dialogues, (2) introducing an innovative multi-modal causal attention mechanism, and (3) utilizing data blending techniques on existing datasets to assure seamless interactions.
arXiv Detail & Related papers (2023-09-25T17:53:29Z)
Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal Prediction for Multimodal Sentiment Analysis [19.07020276666615]
We propose a novel framework named MultiModal Contrastive Learning (MMCL) for multimodal representation to capture intra- and inter-modality dynamics simultaneously. We also design two contrastive learning tasks, instance- and sentiment-based contrastive learning, to promote the process of prediction and learn more interactive information related to sentiment.
arXiv Detail & Related papers (2022-10-26T08:24:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.