Related papers: WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning

WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning

URL: http://arxiv.org/abs/2405.03272v1
Date: Mon, 6 May 2024 08:42:34 GMT
Title: WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning
Authors: Yuanhan Zhang, Kaichen Zhang, Bo Li, Fanyi Pu, Christopher Arif Setiadharma, Jingkang Yang, Ziwei Liu,
Abstract summary: We present WorldQA, a video dataset designed to push the boundaries of multimodal world models. We identify five essential types of world knowledge for question formulation. We introduce WorldRetriever, an agent designed to synthesize expert knowledge into a coherent reasoning chain.
Score: 49.72868038180909
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal information, together with our knowledge, help us to understand the complex and dynamic world. Large language models (LLM) and large multimodal models (LMM), however, still struggle to emulate this capability. In this paper, we present WorldQA, a video understanding dataset designed to push the boundaries of multimodal world models with three appealing properties: (1) Multimodal Inputs: The dataset comprises 1007 question-answer pairs and 303 videos, necessitating the analysis of both auditory and visual data for successful interpretation. (2) World Knowledge: We identify five essential types of world knowledge for question formulation. This approach challenges models to extend their capabilities beyond mere perception. (3) Long-Chain Reasoning: Our dataset introduces an average reasoning step of 4.45, notably surpassing other videoQA datasets. Furthermore, we introduce WorldRetriever, an agent designed to synthesize expert knowledge into a coherent reasoning chain, thereby facilitating accurate responses to WorldQA queries. Extensive evaluations of 13 prominent LLMs and LMMs reveal that WorldRetriever, although being the most effective model, achieved only 70% of humanlevel performance in multiple-choice questions. This finding highlights the necessity for further advancement in the reasoning and comprehension abilities of models. Our experiments also yield several key insights. For instance, while humans tend to perform better with increased frames, current LMMs, including WorldRetriever, show diminished performance under similar conditions. We hope that WorldQA,our methodology, and these insights could contribute to the future development of multimodal world models.

Related papers

HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context [26.506057678587176]
Insufficient context understanding can happen when a model misinterprets multimodal context, resulting in incorrect answers.<n>The shortcut problem occurs when the model overlooks crucial clues in multimodal inputs, directly addressing the query without considering the multimodal information.<n>We introduce a reasoning omni-modal benchmark, IntentBench, aimed at evaluating models in understanding complex human intentions and emotions.
arXiv Detail & Related papers (2025-06-26T14:01:03Z)
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs [44.28540993567552]
We introduce WorldSense, the first benchmark to assess the multi-modal video understanding.<n>We design the evaluation tasks to feature a strong coupling of audio and video.<n>WorldSense encompasses a diverse collection of 1,662 audio-visual synchronised videos.
arXiv Detail & Related papers (2025-02-06T18:59:40Z)
MM-Ego: Towards Building Egocentric Multimodal LLMs [72.47344411599322]
This research aims to explore building a multimodal foundation model for egocentric video understanding. We develop a data engine that efficiently generates 7M high-quality QA samples for egocentric videos ranging from 30 seconds to one hour long, based on human-annotated data. We contribute a challenging egocentric QA benchmark with 629 videos and 7,026 questions to evaluate the models' ability in recognizing and memorizing visual details across videos of varying lengths.
arXiv Detail & Related papers (2024-10-09T17:59:59Z)
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos [155.52885252910693]
We introduce MMWorld, a new benchmark for multi-discipline, multi-faceted multimodal video understanding. MMWorld consists of a human-annotated dataset to evaluate MLLMs with questions about the whole videos and a synthetic dataset to analyze MLLMs within a single modality of perception. The evaluation includes 2 proprietary and 10 open-source MLLMs, which struggle on MMWorld.
arXiv Detail & Related papers (2024-06-12T16:54:54Z)
CinePile: A Long Video Question Answering Dataset and Benchmark [55.30860239555001]
We present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding. Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects. We fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset.
arXiv Detail & Related papers (2024-05-14T17:59:02Z)
Multimodal Large Language Models to Support Real-World Fact-Checking [80.41047725487645]
Multimodal large language models (MLLMs) carry the potential to support humans in processing vast amounts of information. While MLLMs are already being used as a fact-checking tool, their abilities and limitations in this regard are understudied. We propose a framework for systematically assessing the capacity of current multimodal models to facilitate real-world fact-checking.
arXiv Detail & Related papers (2024-03-06T11:32:41Z)
ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos [53.92440577914417]
ACQUIRED consists of 3.9K annotated videos, encompassing a wide range of event types and incorporating both first and third-person viewpoints. Each video is annotated with questions that span three distinct dimensions of reasoning, including physical, social, and temporal. We benchmark our dataset against several state-of-the-art language-only and multimodal models and experimental results demonstrate a significant performance gap.
arXiv Detail & Related papers (2023-11-02T22:17:03Z)
Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds [37.22688246779871]
Large language models (LLMs) can equip embodied agents with the self-driven capability to interact with the world. LLMs tend to overlook the visual richness of open worlds, rendering the entire interactive process akin to "a blindfolded text-based game" We propose Steve-Eye, an end-to-end trained large multimodal model designed to address this limitation.
arXiv Detail & Related papers (2023-10-20T03:22:05Z)
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark. Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs. We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.