UNO-Bench: A Unified Benchmark for Exploring the Compositional Law Between Uni-modal and Omni-modal in OmniModels
- URL: http://arxiv.org/abs/2510.18915v2
- Date: Mon, 27 Oct 2025 03:35:48 GMT
- Title: UNO-Bench: A Unified Benchmark for Exploring the Compositional Law Between Uni-modal and Omni-modal in OmniModels
- Authors: Chen Chen, ZeYang Hu, Fengjiao Chen, Liya Ma, Jiaxing Liu, Xiaoyu Li, Xuezhi Cao,
- Abstract summary: Multimodal Large Languages models have been progressing from uni-modal understanding toward unifying visual, audio and language modalities, collectively termed omni models.<n>We propose a novel, high quality and UNified Omni model benchmark, UNO-Bench, which effectively assesses both UNi-modal and Omni-modal capabilities.<n>The benchmark consists of curated human samples, with 98% cross-modality solvability, across 44 task types, and an innovative multi-step open-ended question type for assessing complex reasoning.
- Score: 12.233067923710635
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal Large Languages models have been progressing from uni-modal understanding toward unifying visual, audio and language modalities, collectively termed omni models. However, the correlation between uni-modal and omni-modal remains unclear, which requires comprehensive evaluation to drive omni model's intelligence evolution. In this work, we propose a novel, high quality and UNified Omni model benchmark, UNO-Bench, which effectively assesses both UNi-modal and Omni-modal capabilities. The benchmark consists of 3730 human curated samples, with 98% cross-modality solvability, across 44 task types, and an innovative multi-step open-ended question type for assessing complex reasoning. Besides, a general scoring model supporting 6 question types is proposed for automated evaluation with 95% accuracy. Experimental result shows the Compositional Law between omni-modal and uni-modal performance and the omni-modal capability manifests as a bottleneck effect on weak models, while exhibiting synergistic promotion on strong models. The code and data are available at https://github.com/meituan-longcat/UNO-Bench
Related papers
- OmniRet: Efficient and High-Fidelity Omni Modality Retrieval [51.80205678389465]
We present OmniRet, the first retrieval model capable of handling complex, composed queries spanning three key modalities: text, vision, and audio.<n>Our model demonstrates significant improvements on composed query, audio and video retrieval tasks, while achieving on-par performance with state-of-the-art models on others.
arXiv Detail & Related papers (2026-03-02T17:19:55Z) - NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching [64.10695425442164]
We introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms.<n>Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal generation and understanding benchmarks.<n>To advance further research, we release training details, data protocols, and open-source both the code and model checkpoints.
arXiv Detail & Related papers (2025-10-15T16:25:18Z) - Text Takes Over: A Study of Modality Bias in Multimodal Intent Detection [12.754751703604734]
This work investigates the effectiveness of Large Language Models (LLMs) and non-LLMs, including text-only and multi-modal models, in the multimodal intent detection task.<n>Our study reveals that Mistral-7B, a text-only LLM, outperforms most competitive multimodal models by approximately 9% on MIntRec-1 and 4% on MIntRec2.0 datasets.
arXiv Detail & Related papers (2025-08-22T06:29:29Z) - Is Extending Modality The Right Path Towards Omni-Modality? [34.79461922911039]
We study the effect of extending modality, the dominant technique for training multimodal models, where an off-the-shelf language model is fine-tuned on target-domain and language data.<n>We analyze these trade-offs and provide insights into the feasibility of achieving true omni-modality using current approaches.
arXiv Detail & Related papers (2025-06-02T17:01:40Z) - Ola: Pushing the Frontiers of Omni-Modal Language Model [88.72389428177942]
We present Ola, an omni-modal language model that achieves competitive performance across image, video, and audio understanding.<n>Ola incorporates advanced visual understanding and audio recognition capabilities through several critical and effective improvements.<n>We aim to make Ola a fully open omni-modal understanding solution to advance future research in this emerging field.
arXiv Detail & Related papers (2025-02-06T18:59:55Z) - Baichuan-Omni-1.5 Technical Report [78.49101296394218]
Baichuan- Omni-1.5 is an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities.<n>We establish a comprehensive data cleaning and synthesis pipeline for multimodal data, obtaining about 500B high-quality data.<n>Second, an audio-tokenizer has been designed to capture both semantic and acoustic information from audio, enabling seamless integration and enhanced compatibility with MLLM.
arXiv Detail & Related papers (2025-01-26T02:19:03Z) - OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities [124.05360767047539]
We introduce OmnixR, an evaluation suite designed to benchmark SoTA Omni-modality Language Models.
evaluating OLMs, which integrate multiple modalities such as text, vision, and audio, presents unique challenges.
Our experiments find that all state-of-the-art OLMs struggle with OmnixR questions that require integrating information from multiple modalities to answer.
arXiv Detail & Related papers (2024-10-16T04:29:46Z) - OmniBench: Towards The Future of Universal Omni-Language Models [63.16606414452612]
We introduce OmniBench, a novel benchmark designed to evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously.<n>Our evaluation reveals that open-source OLMs show significant limitations in instruction-following and reasoning in tri-modal contexts.<n>We advocate for developing more robust tri-modal integration techniques and training strategies to enhance OLM performance.
arXiv Detail & Related papers (2024-09-23T17:59:05Z) - Explore the Limits of Omni-modal Pretraining at Scale [21.82148059125346]
We propose a scalable pretraining paradigm, named Multimodal Context (MiCo)
MiCo can scale up the numbers of modalities and amount of data, together with the model parameters, in the pretraining process.
Our models establish 37 new records for state-of-the-art performance.
arXiv Detail & Related papers (2024-06-13T17:59:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.