WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
- URL: http://arxiv.org/abs/2502.04326v1
- Date: Thu, 06 Feb 2025 18:59:40 GMT
- Title: WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
- Authors: Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, Weidi Xie,
- Abstract summary: We introduce WorldSense, the first benchmark to assess the multi-modal video understanding.<n>We design the evaluation tasks to feature a strong coupling of audio and video.<n>WorldSense encompasses a diverse collection of 1,662 audio-visual synchronised videos.
- Score: 44.28540993567552
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we introduce WorldSense, the first benchmark to assess the multi-modal video understanding, that simultaneously encompasses visual, audio, and text inputs. In contrast to existing benchmarks, our WorldSense has several features: (i) collaboration of omni-modality, we design the evaluation tasks to feature a strong coupling of audio and video, requiring models to effectively utilize the synergistic perception of omni-modality; (ii) diversity of videos and tasks, WorldSense encompasses a diverse collection of 1,662 audio-visual synchronised videos, systematically categorized into 8 primary domains and 67 fine-grained subcategories to cover the broad scenarios, and 3,172 multi-choice QA pairs across 26 distinct tasks to enable the comprehensive evaluation; (iii) high-quality annotations, all the QA pairs are manually labeled by 80 expert annotators with multiple rounds of correction to ensure quality. Based on our WorldSense, we extensively evaluate various state-of-the-art models. The experimental results indicate that existing models face significant challenges in understanding real-world scenarios (48.0% best accuracy). We hope our WorldSense can provide a platform for evaluating the ability in constructing and understanding coherent contexts from omni-modality.
Related papers
- Empowering Agentic Video Analytics Systems with Video Language Models [16.962065998310223]
Video-Language Models (VLMs) offer potential for enabling open-ended video understanding, reasoning, and analytics.
VLMs' limited context windows present challenges when processing ultra-long video content.
We introduce AVA, a VLM-powered system designed for open-ended, advanced video analytics.
arXiv Detail & Related papers (2025-05-01T02:40:23Z) - OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts [46.77966058862399]
We introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts.
We propose a novel framework, Multi-modal Multiplexing Modeling (M4), designed to enable an inference-efficient streaming model that can see, listen while generating.
arXiv Detail & Related papers (2025-03-29T02:46:58Z) - MAVERIX: Multimodal Audio-Visual Evaluation Reasoning IndeX [15.038202110401336]
MAVERIX(Multimodal Audio-Visual Evaluation Reasoning IndeX) is a novel benchmark with 700 videos and 2,556 questions.
It is designed to evaluate multimodal models through tasks that necessitate close integration of video and audio information.
Experiments with state-of-the-art models, including Gemini 1.5 Pro and o1, show performance approaching human levels.
arXiv Detail & Related papers (2025-03-27T17:04:33Z) - MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion [44.45109614673675]
We create a search system that extracts text and features from both visual and audio modalities.
MMMORRF is both effective and efficient, demonstrating practicality in searching videos based on users' information needs.
arXiv Detail & Related papers (2025-03-26T16:28:04Z) - OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities [124.05360767047539]
We introduce OmnixR, an evaluation suite designed to benchmark SoTA Omni-modality Language Models.
evaluating OLMs, which integrate multiple modalities such as text, vision, and audio, presents unique challenges.
Our experiments find that all state-of-the-art OLMs struggle with OmnixR questions that require integrating information from multiple modalities to answer.
arXiv Detail & Related papers (2024-10-16T04:29:46Z) - MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)
MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding [57.630136434038384]
We introduce E.T. Bench (Event-Level & Time-Sensitive Video Understanding Benchmark), a large-scale benchmark for open-ended event-level video understanding.
We extensively evaluated 8 Image-LLMs and 12 Video-LLMs on our benchmark, and the results reveal that state-of-the-art models for coarse-level (video-level) understanding struggle to solve our fine-grained tasks.
Our simple but effective solution demonstrates superior performance in multiple scenarios.
arXiv Detail & Related papers (2024-09-26T17:53:04Z) - OmniBench: Towards The Future of Universal Omni-Language Models [63.16606414452612]
We introduce OmniBench, a novel benchmark designed to rigorously evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously.
Our main findings reveal that most OLMs exhibit critical limitations in instruction-following and reasoning capabilities within tri-modal contexts.
To address this gap, we curate an instruction tuning dataset of 84.5K training samples, OmniInstruct, for training OLMs to adapt to multimodal contexts.
arXiv Detail & Related papers (2024-09-23T17:59:05Z) - AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models [34.843603169616486]
We introduce AlignMMBench, a comprehensive alignment benchmark for emerging Chinese Vision-Language Models (VLMs)
This benchmark is meticulously curated from real-world scenarios and Chinese Internet sources, encompassing thirteen specific tasks across three categories, and includes both single-turn and multi-turn dialogue scenarios.
To facilitate the evaluation pipeline, we propose CritiqueVLM, a rule-calibrated evaluator that exceeds GPT-4's evaluation ability.
arXiv Detail & Related papers (2024-06-13T16:30:14Z) - WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning [49.72868038180909]
We present WorldQA, a video dataset designed to push the boundaries of multimodal world models.
We identify five essential types of world knowledge for question formulation.
We introduce WorldRetriever, an agent designed to synthesize expert knowledge into a coherent reasoning chain.
arXiv Detail & Related papers (2024-05-06T08:42:34Z) - MVBench: A Comprehensive Multi-modal Video Understanding Benchmark [63.14000659130736]
We introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench.
We first introduce a novel static-to-dynamic method to define these temporal-related tasks.
Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task.
arXiv Detail & Related papers (2023-11-28T17:59:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.