Ola: Pushing the Frontiers of Omni-Modal Language Model
- URL: http://arxiv.org/abs/2502.04328v3
- Date: Mon, 02 Jun 2025 19:33:24 GMT
- Title: Ola: Pushing the Frontiers of Omni-Modal Language Model
- Authors: Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao,
- Abstract summary: We present Ola, an omni-modal language model that achieves competitive performance across image, video, and audio understanding.<n>Ola incorporates advanced visual understanding and audio recognition capabilities through several critical and effective improvements.<n>We aim to make Ola a fully open omni-modal understanding solution to advance future research in this emerging field.
- Score: 88.72389428177942
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in large language models, particularly following GPT-4o, have sparked increasing interest in developing omni-modal models capable of understanding more modalities. While some open-source alternatives have emerged, there is still a notable lag behind specialized single-modality models in performance. In this paper, we present Ola, an Omni-modal Language model that achieves competitive performance across image, video, and audio understanding compared to specialized counterparts, pushing the frontiers of the omni-modal language model to a large extent. We conduct a comprehensive exploration of architectural design, data curation, and training strategies essential for building a robust omni-modal model. Ola incorporates advanced visual understanding and audio recognition capabilities through several critical and effective improvements over mainstream baselines. Moreover, we rethink inter-modal relationships during omni-modal training, emphasizing cross-modal alignment with video as a central bridge, and propose a progressive training pipeline that begins with the most distinct modalities and gradually moves towards closer modality alignment. Extensive experiments demonstrate that Ola surpasses existing open omni-modal LLMs across all modalities while achieving highly competitive performance compared to state-of-the-art specialized models of similar sizes. We aim to make Ola a fully open omni-modal understanding solution to advance future research in this emerging field. Model weights, code, and data are open-sourced at https://github.com/Ola-Omni/Ola.
Related papers
- VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo [25.89459841661218]
Ve Omni is a training framework for large language models (LLMs)<n>Ve Omni introduces model-centric distributed recipes that decouples communication from computation.<n>Ve Omni can be trained with over 2,800 tokens/sec/GPU throughput and scale to 160K context lengths via 3D parallelism on 128 GPUs.
arXiv Detail & Related papers (2025-08-04T11:33:04Z) - Is Extending Modality The Right Path Towards Omni-Modality? [34.79461922911039]
We study the effect of extending modality, the dominant technique for training multimodal models, where an off-the-shelf language model is fine-tuned on target-domain and language data.<n>We analyze these trade-offs and provide insights into the feasibility of achieving true omni-modality using current approaches.
arXiv Detail & Related papers (2025-06-02T17:01:40Z) - Nexus-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision [50.23246260804145]
We introduce textbfNexus-O, an industry-level textbfomni-perceptive and -interactive model capable of efficiently processing Audio, Image, Video, and Text data.
We address three key research questions: First, how can models be efficiently designed and trained to achieve tri-modal alignment, understanding and reasoning capabilities across multiple modalities?
Second, what approaches can be implemented to evaluate tri-modal model robustness, ensuring reliable performance and applicability in real-world scenarios?
Third, what strategies can be employed to curate and obtain high-quality, real-life scenario
arXiv Detail & Related papers (2025-02-26T17:26:36Z) - OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis [68.73476738779628]
name is a two-stage training framework that integrates omnimodal alignment and speech generation.
It surpasses state-of-the-art models across omnimodal, vision-language, and speech-language benchmarks.
name achieves real-time speech generation with 1s latency at non-autoregressive mode.
arXiv Detail & Related papers (2025-01-08T15:18:09Z) - Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition [57.131546757903834]
Lyra is an efficient MLLM that enhances multimodal abilities, including advanced long-speech comprehension, sound understanding, cross-modality efficiency, and seamless speech interaction.<n>Lyra achieves state-of-the-art performance on various vision-language, vision-speech, and speech-language benchmarks, while also using fewer computational resources and less training data.
arXiv Detail & Related papers (2024-12-12T17:50:39Z) - Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities [0.0]
Mini- Omni2 is a visual-audio assistant capable of providing real-time, end-to-end voice responses to visoin and audio queries.
We propose a three-stage training process to align modalities, allowing the language model to handle multi-modal inputs and outputs after training on a limited dataset.
arXiv Detail & Related papers (2024-10-15T02:10:45Z) - MIO: A Foundation Model on Multimodal Tokens [74.85153216521945]
We introduce MIO, a novel foundation model built on multimodal tokens.<n>MIO is capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner.
arXiv Detail & Related papers (2024-09-26T09:57:16Z) - OmniBench: Towards The Future of Universal Omni-Language Models [63.16606414452612]
We introduce OmniBench, a novel benchmark designed to rigorously evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously.
Our main findings reveal that most OLMs exhibit critical limitations in instruction-following and reasoning capabilities within tri-modal contexts.
To address this gap, we curate an instruction tuning dataset of 84.5K training samples, OmniInstruct, for training OLMs to adapt to multimodal contexts.
arXiv Detail & Related papers (2024-09-23T17:59:05Z) - AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling [115.89786751297348]
We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities.
We build a multimodal text-centric dataset for multimodal alignment pre-training.
We show that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities.
arXiv Detail & Related papers (2024-02-19T15:33:10Z) - What Makes for Robust Multi-Modal Models in the Face of Missing
Modalities? [35.19295402483624]
We model the scenarios of multi-modal models encountering missing modalities from an information-theoretic perspective.
We introduce Uni-Modal Ensemble with Missing Modality Adaptation (UME-MMA)
UME-MMA employs uni-modal pre-trained weights for the multi-modal model to enhance feature extraction and utilizes missing modality data augmentation techniques to better adapt to situations with missing modalities.
arXiv Detail & Related papers (2023-10-10T07:47:57Z) - X-modaler: A Versatile and High-performance Codebase for Cross-modal
Analytics [99.03895740754402]
X-modaler encapsulates the state-of-the-art cross-modal analytics into several general-purpose stages.
X-modaler is an Apache-licensed, and its source codes, sample projects and pre-trained models are available on-line.
arXiv Detail & Related papers (2021-08-18T16:05:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.