OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
- URL: http://arxiv.org/abs/2510.15870v2
- Date: Mon, 27 Oct 2025 19:12:55 GMT
- Title: OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
- Authors: Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, Wei Huang, Ligeng Zhu, Yuanhang Su, Sean Lin, An-Chieh Cheng, Zhen Wan, Jinchuan Tian, Yuming Lou, Dong Yang, Zhijian Liu, Yukang Chen, Ambrish Dantrey, Ehsan Jahangiri, Sreyan Ghosh, Daguang Xu, Ehsan Hosseini-Asl, Danial Mohseni Taheri, Vidya Murali, Sifei Liu, Yao Lu, Oluwatobi Olabiyi, Yu-Chiang Frank Wang, Rafael Valle, Bryan Catanzaro, Andrew Tao, Song Han, Jan Kautz, Hongxu Yin, Pavlo Molchanov,
- Abstract summary: We introduce OmniVinci, an initiative to build a strong, open-source, omni-modal LLM.<n>For model architecture, we present three key innovations: (i) OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared omni-modal latent space; (ii) Temporal Embedding Grouping for capturing temporal alignment between vision and audio signals; and (iii) Constrained Rotary Time Embedding for encoding absolute temporal information in omni-modal embeddings.
- Score: 146.029449832893
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advancing machine intelligence requires developing the ability to perceive across multiple modalities, much as humans sense the world. We introduce OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We carefully study the design choices across model architecture and data curation. For model architecture, we present three key innovations: (i) OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared omni-modal latent space; (ii) Temporal Embedding Grouping for capturing relative temporal alignment between vision and audio signals; and (iii) Constrained Rotary Time Embedding for encoding absolute temporal information in omni-modal embeddings. We introduce a curation and synthesis pipeline that generates 24M single-modal and omni-modal conversations. We find that modalities reinforce one another in both perception and reasoning. Our model, OmniVinci, outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while using just 0.2T training tokens - a 6 times reduction compared to Qwen2.5-Omni's 1.2T. We finally demonstrate omni-modal advantages in downstream applications spanning robotics, medical AI, and smart factory.
Related papers
- OmniGAIA: Towards Native Omni-Modal AI Agents [103.79729735478924]
We introduce a benchmark designed to evaluate omni-modal agents on tasks requiring deep reasoning and multi-turn tool execution.<n>We propose OmniAtlas, a native omni-modal foundation agent under tool-integrated reasoning paradigm with active omni-modal perception.
arXiv Detail & Related papers (2026-02-26T11:35:04Z) - Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception [97.32606786622728]
We present a systematic and comprehensive investigation of omni detailed perception from the perspectives of the data pipeline, models, and benchmark.<n>We propose Omni-Detective, an agentic data generation pipeline integrating tool-calling, to autonomously produce highly detailed yet minimally hallucinatory multimodal data.<n>Based on the data generated with Omni-Detective, we train two captioning models: Audio-Captioner for audio-only detailed perception, and Omni-Captioner for audio-visual detailed perception.
arXiv Detail & Related papers (2025-10-14T17:00:09Z) - OmniEval: A Benchmark for Evaluating Omni-modal Models with Visual, Auditory, and Textual Inputs [19.214764707089884]
We introduce OmniEval, a benchmark for evaluating omni-modality models.<n>We design evaluation tasks that highlight the strong coupling between audio and video.<n>We conduct experiments on OmniEval with several omni-modality models.
arXiv Detail & Related papers (2025-06-26T02:54:24Z) - OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts [46.77966058862399]
We introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts.<n>We propose a novel framework, Multi-modal Multiplexing Modeling (M4), designed to enable an inference-efficient streaming model that can see, listen while generating.
arXiv Detail & Related papers (2025-03-29T02:46:58Z) - Ola: Pushing the Frontiers of Omni-Modal Language Model [88.72389428177942]
We present Ola, an omni-modal language model that achieves competitive performance across image, video, and audio understanding.<n>Ola incorporates advanced visual understanding and audio recognition capabilities through several critical and effective improvements.<n>We aim to make Ola a fully open omni-modal understanding solution to advance future research in this emerging field.
arXiv Detail & Related papers (2025-02-06T18:59:55Z) - Baichuan-Omni-1.5 Technical Report [78.49101296394218]
Baichuan- Omni-1.5 is an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities.<n>We establish a comprehensive data cleaning and synthesis pipeline for multimodal data, obtaining about 500B high-quality data.<n>Second, an audio-tokenizer has been designed to capture both semantic and acoustic information from audio, enabling seamless integration and enhanced compatibility with MLLM.
arXiv Detail & Related papers (2025-01-26T02:19:03Z) - OmniBench: Towards The Future of Universal Omni-Language Models [63.16606414452612]
We introduce OmniBench, a novel benchmark designed to evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously.<n>Our evaluation reveals that open-source OLMs show significant limitations in instruction-following and reasoning in tri-modal contexts.<n>We advocate for developing more robust tri-modal integration techniques and training strategies to enhance OLM performance.
arXiv Detail & Related papers (2024-09-23T17:59:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.