A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual
Question Answering
- URL: http://arxiv.org/abs/2311.07536v2
- Date: Sat, 27 Jan 2024 14:16:54 GMT
- Title: A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual
Question Answering
- Authors: Yunxin Li, Longyue Wang, Baotian Hu, Xinyu Chen, Wanqi Zhong, Chenyang
Lyu, Wei Wang, Min Zhang
- Abstract summary: multimodal large models (MLMs) has significantly advanced the field of visual understanding.
Yet, the true challenge lies in the domain of knowledge-intensive visual question answering (VQA) tasks.
This study provides an in-depth evaluation of the newly introduced GPT-4V.
- Score: 56.01977227584777
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The emergence of multimodal large models (MLMs) has significantly advanced
the field of visual understanding, offering remarkable capabilities in the
realm of visual question answering (VQA). Yet, the true challenge lies in the
domain of knowledge-intensive VQA tasks, which necessitate not just recognition
of visual elements, but also a deep comprehension of the visual information in
conjunction with a vast repository of learned knowledge. To uncover such
capabilities of MLMs, particularly the newly introduced GPT-4V, we provide an
in-depth evaluation from three perspectives: 1) Commonsense Knowledge, which
assesses how well models can understand visual cues and connect to general
knowledge; 2) Fine-grained World Knowledge, which tests the model's skill in
reasoning out specific knowledge from images, showcasing their proficiency
across various specialized fields; 3) Comprehensive Knowledge with
Decision-making Rationales, which examines model's capability to provide
logical explanations for its inference, facilitating a deeper analysis from the
interpretability perspective. Extensive experiments indicate that GPT-4V
achieves SOTA performance on above three tasks. Interestingly, we find that: a)
GPT-4V demonstrates enhanced reasoning and explanation when using composite
images as few-shot; b) GPT-4V produces severe hallucinations when dealing with
world knowledge, highlighting the future need for advancements in this research
direction.
Related papers
- Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models [22.545127591893028]
Multimodal Large Language Models (MLLMs) such as GPT-4V and Gemini Pro face challenges in achieving human-level perception in Visual Question Answering (VQA)
This is mainly due to their limited capability to effectively integrate complex visual cues with textual information and potential object hallucinations.
We present a novel approach, Joint Visual and Text Prompting (VTPrompt), that employs fine-grained visual information to enhance the capability of MLLMs in VQA.
arXiv Detail & Related papers (2024-04-06T05:59:02Z) - Effectiveness Assessment of Recent Large Vision-Language Models [78.69439393646554]
This paper endeavors to evaluate the competency of popular large vision-language models (LVLMs) in specialized and general tasks.
We employ six challenging tasks in three different application scenarios: natural, healthcare, and industrial.
We examine the performance of three recent open-source LVLMs, including MiniGPT-v2, LLaVA-1.5, and Shikra, on both visual recognition and localization in these tasks.
arXiv Detail & Related papers (2024-03-07T08:25:27Z) - Finer: Investigating and Enhancing Fine-Grained Visual Concept
Recognition in Large Vision Language Models [68.46457611340097]
In-depth analyses show that instruction-tuned LVLMs exhibit modality gap, showing discrepancy when given textual and visual inputs that correspond to the same concept.
We propose a multiple attribute-centric evaluation benchmark, Finer, to evaluate LVLMs' fine-grained visual comprehension ability and provide significantly improved explainability.
arXiv Detail & Related papers (2024-02-26T05:43:51Z) - GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks [70.98062518872999]
We validate GPT-4V's capabilities for evaluation purposes, addressing tasks ranging from foundational image-to-text and text-to-image synthesis to high-level image-to-image translations and multi-images to text alignment.
Notably, GPT-4V shows promising agreement with humans across various tasks and evaluation methods, demonstrating immense potential for multi-modal LLMs as evaluators.
arXiv Detail & Related papers (2023-11-02T16:11:09Z) - KGQuiz: Evaluating the Generalization of Encoded Knowledge in Large Language Models [39.554274096542244]
KGQuiz is a knowledge-intensive benchmark to investigate the knowledge generalization abilities of large language models.
We evaluate 10 open-source and black-box LLMs on the KGQuiz benchmark across the five knowledge-intensive tasks and knowledge domains.
We envision KGQuiz as a testbed to analyze such nuanced variations in performance across domains and task formats.
arXiv Detail & Related papers (2023-10-15T04:00:36Z) - The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) [121.42924593374127]
We analyze the latest model, GPT-4V, to deepen the understanding of LMMs.
GPT-4V's unprecedented ability in processing arbitrarily interleaved multimodal inputs makes it a powerful multimodal generalist system.
GPT-4V's unique capability of understanding visual markers drawn on input images can give rise to new human-computer interaction methods.
arXiv Detail & Related papers (2023-09-29T17:34:51Z) - Reasoning over Vision and Language: Exploring the Benefits of
Supplemental Knowledge [59.87823082513752]
This paper investigates the injection of knowledge from general-purpose knowledge bases (KBs) into vision-and-language transformers.
We empirically study the relevance of various KBs to multiple tasks and benchmarks.
The technique is model-agnostic and can expand the applicability of any vision-and-language transformer with minimal computational overhead.
arXiv Detail & Related papers (2021-01-15T08:37:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.