Related papers: A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual Question Answering

A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual Question Answering

URL: http://arxiv.org/abs/2311.07536v2
Date: Sat, 27 Jan 2024 14:16:54 GMT
Title: A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual Question Answering
Authors: Yunxin Li, Longyue Wang, Baotian Hu, Xinyu Chen, Wanqi Zhong, Chenyang Lyu, Wei Wang, Min Zhang
Abstract summary: multimodal large models (MLMs) has significantly advanced the field of visual understanding. Yet, the true challenge lies in the domain of knowledge-intensive visual question answering (VQA) tasks. This study provides an in-depth evaluation of the newly introduced GPT-4V.
Score: 56.01977227584777
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The emergence of multimodal large models (MLMs) has significantly advanced the field of visual understanding, offering remarkable capabilities in the realm of visual question answering (VQA). Yet, the true challenge lies in the domain of knowledge-intensive VQA tasks, which necessitate not just recognition of visual elements, but also a deep comprehension of the visual information in conjunction with a vast repository of learned knowledge. To uncover such capabilities of MLMs, particularly the newly introduced GPT-4V, we provide an in-depth evaluation from three perspectives: 1) Commonsense Knowledge, which assesses how well models can understand visual cues and connect to general knowledge; 2) Fine-grained World Knowledge, which tests the model's skill in reasoning out specific knowledge from images, showcasing their proficiency across various specialized fields; 3) Comprehensive Knowledge with Decision-making Rationales, which examines model's capability to provide logical explanations for its inference, facilitating a deeper analysis from the interpretability perspective. Extensive experiments indicate that GPT-4V achieves SOTA performance on above three tasks. Interestingly, we find that: a) GPT-4V demonstrates enhanced reasoning and explanation when using composite images as few-shot; b) GPT-4V produces severe hallucinations when dealing with world knowledge, highlighting the future need for advancements in this research direction.

Related papers

Taming the Untamed: Graph-Based Knowledge Retrieval and Reasoning for MLLMs to Conquer the Unknown [14.8657860984074]
multimodal large language models (MLLMs) often fail in rarely encountered domain-specific tasks due to limited relevant knowledge.<n>We construct a multimodal knowledge graph (MH-MMKG) which incorporates multi-modalities and intricate entity relations.<n>We also design a series of challenging queries based on MH-MMKG to evaluate the models' ability for complex knowledge retrieval and reasoning.
arXiv Detail & Related papers (2025-06-21T05:01:02Z)
Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation [53.84282335629258]
We introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 3.49 million questions and 3.32 million images. Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives. We uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance.
arXiv Detail & Related papers (2025-04-21T09:30:41Z)
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding [61.26026947423187]
Human experts excel at fine-grained visual discrimination by leveraging domain knowledge to refine perceptual features. Current Multimodal Large Language Models (MLLMs) struggle to integrate reasoning into visual perception. We propose DeepPerception, an MLLM enhanced with cognitive visual perception capabilities.
arXiv Detail & Related papers (2025-03-17T04:06:34Z)
MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models [115.16022378880376]
We introduce a multimodal retrieval-augmented generation benchmark, MRAG-Bench. MRAG-Bench consists of 16,130 images and 1,353 human-annotated multiple-choice questions. Results show that all large vision-language models (LVLMs) exhibit greater improvements when augmented with images compared to textual knowledge.
arXiv Detail & Related papers (2024-10-10T17:55:02Z)
X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs [49.30255148577368]
X-Former is a lightweight transformer module designed to exploit the complementary strengths of CL and MIM. X-Former first bootstraps vision-language representation learning and multimodal-to-multimodal generative learning from two frozen vision encoders. It further bootstraps vision-to-language generative learning from a frozen LLM to ensure visual features from X-Former can be interpreted by the LLM.
arXiv Detail & Related papers (2024-07-18T18:39:54Z)
Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models [22.545127591893028]
Multimodal Large Language Models (MLLMs) such as GPT-4V and Gemini Pro face challenges in achieving human-level perception in Visual Question Answering (VQA) This is mainly due to their limited capability to effectively integrate complex visual cues with textual information and potential object hallucinations. We present a novel approach, Joint Visual and Text Prompting (VTPrompt), that employs fine-grained visual information to enhance the capability of MLLMs in VQA.
arXiv Detail & Related papers (2024-04-06T05:59:02Z)
Finer: Investigating and Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models [57.95366341738857]
In-depth analyses show that instruction-tuned LVLMs exhibit modality gap, showing discrepancy when given textual and visual inputs that correspond to the same concept. We propose a multiple attribute-centric evaluation benchmark, Finer, to evaluate LVLMs' fine-grained visual comprehension ability and provide significantly improved explainability.
arXiv Detail & Related papers (2024-02-26T05:43:51Z)
Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment [31.688373463643373]
Visual knowledge plays a significant role in analyzing, inferring, and interpreting information from visuals, helping improve the accuracy of answers to knowledge-based visual questions. We present a Cognitive Visual-Language Mapper (CVLM), which contains a pretrained Visual Knowledge Aligner (VKA) and a Fine-grained Knowledge Adapter (FKA) used in the multimodal instruction tuning stage. We conduct extensive experiments on knowledge-based VQA benchmarks and experimental results show that CVLM significantly improves the performance of LMMs on knowledge-based VQA (average gain by 5.0%).
arXiv Detail & Related papers (2024-02-21T06:34:46Z)
Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models [121.83413400686139]
This paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism. We introduce a novel method that incorporates multi-task encoders and visual tools into the existing MLLMs training and inference pipeline.
arXiv Detail & Related papers (2024-01-06T02:02:34Z)
KGQuiz: Evaluating the Generalization of Encoded Knowledge in Large Language Models [39.554274096542244]
KGQuiz is a knowledge-intensive benchmark to investigate the knowledge generalization abilities of large language models. We evaluate 10 open-source and black-box LLMs on the KGQuiz benchmark across the five knowledge-intensive tasks and knowledge domains. We envision KGQuiz as a testbed to analyze such nuanced variations in performance across domains and task formats.
arXiv Detail & Related papers (2023-10-15T04:00:36Z)
A survey on knowledge-enhanced multimodal learning [1.8591405259852054]
Multimodal learning has been a field of increasing interest, aiming to combine various modalities in a single joint representation. Especially in the area of visiolinguistic (VL) learning multiple models and techniques have been developed, targeting a variety of tasks that involve images and text. VL models have reached unprecedented performances by extending the idea of Transformers, so that both modalities can learn from each other.
arXiv Detail & Related papers (2022-11-19T14:00:50Z)
Reasoning over Vision and Language: Exploring the Benefits of Supplemental Knowledge [59.87823082513752]
This paper investigates the injection of knowledge from general-purpose knowledge bases (KBs) into vision-and-language transformers. We empirically study the relevance of various KBs to multiple tasks and benchmarks. The technique is model-agnostic and can expand the applicability of any vision-and-language transformer with minimal computational overhead.
arXiv Detail & Related papers (2021-01-15T08:37:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.