WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning
- URL: http://arxiv.org/abs/2405.03272v1
- Date: Mon, 6 May 2024 08:42:34 GMT
- Title: WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning
- Authors: Yuanhan Zhang, Kaichen Zhang, Bo Li, Fanyi Pu, Christopher Arif Setiadharma, Jingkang Yang, Ziwei Liu,
- Abstract summary: We present WorldQA, a video dataset designed to push the boundaries of multimodal world models.
We identify five essential types of world knowledge for question formulation.
We introduce WorldRetriever, an agent designed to synthesize expert knowledge into a coherent reasoning chain.
- Score: 49.72868038180909
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal information, together with our knowledge, help us to understand the complex and dynamic world. Large language models (LLM) and large multimodal models (LMM), however, still struggle to emulate this capability. In this paper, we present WorldQA, a video understanding dataset designed to push the boundaries of multimodal world models with three appealing properties: (1) Multimodal Inputs: The dataset comprises 1007 question-answer pairs and 303 videos, necessitating the analysis of both auditory and visual data for successful interpretation. (2) World Knowledge: We identify five essential types of world knowledge for question formulation. This approach challenges models to extend their capabilities beyond mere perception. (3) Long-Chain Reasoning: Our dataset introduces an average reasoning step of 4.45, notably surpassing other videoQA datasets. Furthermore, we introduce WorldRetriever, an agent designed to synthesize expert knowledge into a coherent reasoning chain, thereby facilitating accurate responses to WorldQA queries. Extensive evaluations of 13 prominent LLMs and LMMs reveal that WorldRetriever, although being the most effective model, achieved only 70% of humanlevel performance in multiple-choice questions. This finding highlights the necessity for further advancement in the reasoning and comprehension abilities of models. Our experiments also yield several key insights. For instance, while humans tend to perform better with increased frames, current LMMs, including WorldRetriever, show diminished performance under similar conditions. We hope that WorldQA,our methodology, and these insights could contribute to the future development of multimodal world models.
Related papers
- From Efficient Multimodal Models to World Models: A Survey [28.780451336834876]
Multimodal Large Models (MLMs) are becoming a significant research focus combining powerful language models with multimodal learning.
This review explores the latest developments and challenges in large instructions, emphasizing their potential in achieving artificial general intelligence.
arXiv Detail & Related papers (2024-06-27T15:36:43Z) - MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos [155.52885252910693]
We introduce MMWorld, a new benchmark for multi-discipline, multi-faceted multimodal video understanding.
MMWorld consists of a human-annotated dataset to evaluate MLLMs with questions about the whole videos and a synthetic dataset to analyze MLLMs within a single modality of perception.
The evaluation includes 2 proprietary and 10 open-source MLLMs, which struggle on MMWorld.
arXiv Detail & Related papers (2024-06-12T16:54:54Z) - How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs [98.37571997794072]
We present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES)
CVRR-ES comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions.
Our findings provide valuable insights for building the next generation of human-centric AI systems.
arXiv Detail & Related papers (2024-05-06T17:59:45Z) - Multimodal Large Language Models to Support Real-World Fact-Checking [80.41047725487645]
Multimodal large language models (MLLMs) carry the potential to support humans in processing vast amounts of information.
While MLLMs are already being used as a fact-checking tool, their abilities and limitations in this regard are understudied.
We propose a framework for systematically assessing the capacity of current multimodal models to facilitate real-world fact-checking.
arXiv Detail & Related papers (2024-03-06T11:32:41Z) - ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life
Videos [53.92440577914417]
ACQUIRED consists of 3.9K annotated videos, encompassing a wide range of event types and incorporating both first and third-person viewpoints.
Each video is annotated with questions that span three distinct dimensions of reasoning, including physical, social, and temporal.
We benchmark our dataset against several state-of-the-art language-only and multimodal models and experimental results demonstrate a significant performance gap.
arXiv Detail & Related papers (2023-11-02T22:17:03Z) - Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in
Open Worlds [37.22688246779871]
Large language models (LLMs) can equip embodied agents with the self-driven capability to interact with the world.
LLMs tend to overlook the visual richness of open worlds, rendering the entire interactive process akin to "a blindfolded text-based game"
We propose Steve-Eye, an end-to-end trained large multimodal model designed to address this limitation.
arXiv Detail & Related papers (2023-10-20T03:22:05Z) - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark.
Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs.
We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.