Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA
- URL: http://arxiv.org/abs/2401.15847v3
- Date: Thu, 27 Jun 2024 15:38:17 GMT
- Title: Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA
- Authors: Yue Fan, Jing Gu, Kaiwen Zhou, Qianqi Yan, Shan Jiang, Ching-Chen Kuo, Xinze Guan, Xin Eric Wang,
- Abstract summary: We introduce Multipanel Visual Question Answering (MultipanelVQA), a novel benchmark comprising 6,600 triplets of questions, answers, and multipanel images.
Our evaluation shows that questions in the MultipanelVQA benchmark pose significant challenges to the state-of-the-art Multimodal Large Language Models (MLLMs) tested.
- Score: 27.814920184313962
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multipanel images, commonly seen as web screenshots, posters, etc., pervade our daily lives. These images, characterized by their composition of multiple subfigures in distinct layouts, effectively convey information to people. Toward building advanced multimodal AI applications, such as agents that understand complex scenes and navigate through webpages, the skill of multipanel visual reasoning is essential, and a comprehensive evaluation of models in this regard is important. Therefore, we introduce Multipanel Visual Question Answering (MultipanelVQA), a novel benchmark comprising 6,600 triplets of questions, answers, and multipanel images that specifically challenge models in comprehending multipanel images. Our evaluation shows that questions in the MultipanelVQA benchmark pose significant challenges to the state-of-the-art Multimodal Large Language Models (MLLMs) tested, even though humans can attain approximately 99% accuracy on these questions. Distinctively, the MultipanelVQA benchmark features synthetically generated multipanel images specifically crafted to isolate and assess the impact of various factors, such as the layout, on MLLMs' multipanel image comprehension abilities. As a result, in addition to benchmarking the capabilities of MLLMs in understanding multipanel images, we analyze various factors of the multipanel image that affect MLLMs' performance with synthetic data and offer insights for enhancement. Code and data are released at https://sites.google.com/view/multipanelvqa/home.
Related papers
- MIBench: Evaluating Multimodal Large Language Models over Multiple Images [70.44423964171088]
We propose a new benchmark MIBench to comprehensively evaluate fine-grained abilities of MLLMs in multi-image scenarios.
MIBench categorizes the multi-image abilities into three scenarios: multi-image instruction (MII), multimodal knowledge-seeking (MKS) and multimodal in-context learning (MIC), and constructs 13 tasks with a total of 13K annotated samples.
The results reveal that although current models excel in single-image tasks, they exhibit significant shortcomings when faced with multi-image inputs.
arXiv Detail & Related papers (2024-07-21T21:22:58Z) - Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning [15.296263261737026]
We introduce a Multi-Image MIRB Benchmark to evaluate visual language models' ability to compare, analyze, and reason across multiple images.
Our benchmark encompasses four categories: perception, visual world knowledge, reasoning, and multi-hop reasoning.
We demonstrate that while open-source VLMs were shown to approach the GPT-4V in single-image tasks, a significant gap remains in multi-image reasoning tasks.
arXiv Detail & Related papers (2024-06-18T16:02:18Z) - MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding [150.28164854480912]
We introduce MuirBench, a benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs.
MuirBench consists of 12 diverse multi-image tasks (e.g., scene understanding, ordering) that involve 10 categories of multi-image relations.
We show that even the best-performing models like GPT-4o and Gemini Pro find it challenging to solve MuirBench, achieving 68.0% and 49.3% in accuracy.
arXiv Detail & Related papers (2024-06-13T17:59:52Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion [70.9767518332692]
Multimodal Large Language Models (MLLMs) that incorporate LLMs with pre-trained vision models have recently demonstrated impressive performance across diverse vision-language tasks.
However, they fall short to comprehend context involving multiple images.
We propose a two phase paradigm, browse-and-concentrate, to enable in-depth multimodal context fusion.
arXiv Detail & Related papers (2024-02-19T14:59:07Z) - SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models [97.40590590880144]
We develop an extensive Multimodality Large Language Model (MLLM) series.
We assemble a comprehensive dataset covering publicly available resources in language, vision, and vision-language tasks.
We obtain a spectrum of MLLMs that vary in parameter size and multilingual capabilities.
arXiv Detail & Related papers (2024-02-08T18:59:48Z) - SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for
Multi-modal Large Language Models [86.478087039015]
We present a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings.
Based on our proposed joint mixing, we propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images.
We hope our work may cast a light on the exploration of joint mixing in future MLLM research.
arXiv Detail & Related papers (2023-11-13T18:59:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.