Ocean-omni: To Understand the World with Omni-modality
- URL: http://arxiv.org/abs/2410.08565v3
- Date: Tue, 5 Nov 2024 11:29:59 GMT
- Title: Ocean-omni: To Understand the World with Omni-modality
- Authors: Yadong Li, Haoze Sun, Mingan Lin, Tianpeng Li, Guosheng Dong, Tao Zhang, Bowen Ding, Wei Song, Zhenglin Cheng, Yuqi Huo, Song Chen, Xu Li, Da Pan, Shusen Zhang, Xin Wu, Zheng Liang, Jun Liu, Tao Zhang, Keer Lu, Yaqi Zhao, Yanjun Shen, Fan Yang, Kaicheng Yu, Tao Lin, Jianhua Xu, Zenan Zhou, Weipeng Chen,
- Abstract summary: Ocean-omni is the first open-source 7B Multimodal Large Language Model (MLLM)
We introduce Ocean-omni, the first open-source 7B Multimodal Large Language Model (MLLM)
- Score: 28.306965534325904
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The salient multimodal capabilities and interactive experience of GPT-4o highlight its critical role in practical applications, yet it lacks a high-performing open-source counterpart. In this paper, we introduce Ocean-omni, the first open-source 7B Multimodal Large Language Model (MLLM) adept at concurrently processing and analyzing modalities of image, video, audio, and text, while delivering an advanced multimodal interactive experience and strong performance. We propose an effective multimodal training schema starting with 7B model and proceeding through two stages of multimodal alignment and multitask fine-tuning across audio, image, video, and text modal. This approach equips the language model with the ability to handle visual and audio data effectively. Demonstrating strong performance across various omni-modal and multimodal benchmarks, we aim for this contribution to serve as a competitive baseline for the open-source community in advancing multimodal understanding and real-time interaction.
Related papers
- Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond [51.141270065306514]
This tutorial aims to equip researchers, practitioners, and newcomers with the knowledge and skills to leverage multimodal AI.
We will cover the latest multimodal datasets and pretrained models, including those beyond vision and language.
Hands-on laboratories will offer practical experience with state-of-the-art multimodal models.
arXiv Detail & Related papers (2024-10-08T01:41:56Z) - MIO: A Foundation Model on Multimodal Tokens [74.85153216521945]
We introduce MIO, a novel foundation model built on multimodal tokens.
MIO is capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner.
arXiv Detail & Related papers (2024-09-26T09:57:16Z) - VITA: Towards Open-Source Interactive Omni Multimodal LLM [104.52782565106033]
We introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM)
We endow the language model with visual and audio capabilities through two-stage multi-task learning.
VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding.
arXiv Detail & Related papers (2024-08-09T17:59:49Z) - S3: A Simple Strong Sample-effective Multimodal Dialog System [61.31055673156622]
We present a conceptually simple yet powerful baseline for the multimodal dialog task, an S3 model, that achieves near state-of-the-art results.
The system is based on a pre-trained large language model, pre-trained modality encoders for image and audio, and a trainable modality projector.
arXiv Detail & Related papers (2024-06-26T12:45:43Z) - POEM: Interactive Prompt Optimization for Enhancing Multimodal Reasoning of Large Language Models [28.072184039405784]
We present POEM, a visual analytics system to facilitate efficient prompt engineering for large language models (LLMs)
The system enables users to explore the interaction patterns across modalities at varying levels of detail for a comprehensive understanding of the multimodal knowledge elicited by various prompts.
arXiv Detail & Related papers (2024-06-06T08:21:30Z) - mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with
Modality Collaboration [74.31268379055201]
mPLUG-Owl2 is a versatile multi-modal large language model.
It effectively leverages modality collaboration to improve performance in both text and multi-modal tasks.
arXiv Detail & Related papers (2023-11-07T14:21:29Z) - DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via
Multi-Modal Causal Attention [55.2825684201129]
DeepSpeed-VisualChat is designed to optimize Large Language Models (LLMs) by incorporating multi-modal capabilities.
Our framework is notable for (1) its open-source support for multi-round and multi-image dialogues, (2) introducing an innovative multi-modal causal attention mechanism, and (3) utilizing data blending techniques on existing datasets to assure seamless interactions.
arXiv Detail & Related papers (2023-09-25T17:53:29Z) - Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal
Prediction for Multimodal Sentiment Analysis [19.07020276666615]
We propose a novel framework named MultiModal Contrastive Learning (MMCL) for multimodal representation to capture intra- and inter-modality dynamics simultaneously.
We also design two contrastive learning tasks, instance- and sentiment-based contrastive learning, to promote the process of prediction and learn more interactive information related to sentiment.
arXiv Detail & Related papers (2022-10-26T08:24:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.