An Evaluation of GPT-4V and Gemini in Online VQA
- URL: http://arxiv.org/abs/2312.10637v2
- Date: Wed, 14 Feb 2024 03:49:50 GMT
- Title: An Evaluation of GPT-4V and Gemini in Online VQA
- Authors: Mengchen Liu, Chongyan Chen, Danna Gurari
- Abstract summary: We evaluate two state-of-the-art LMMs, GPT-4V and Gemini, on a new visual question answering dataset.
We conduct fine-grained analysis by generating seven types of metadata for nearly 2,000 visual questions.
Our zero-shot performance analysis highlights the types of questions that are most challenging for both models.
- Score: 31.77015255871848
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While there is much excitement about the potential of large multimodal models
(LMM), a comprehensive evaluation is critical to establish their true
capabilities and limitations. In support of this aim, we evaluate two
state-of-the-art LMMs, GPT-4V and Gemini, on a new visual question answering
dataset sourced from an authentic online question answering community. We
conduct fine-grained analysis by generating seven types of metadata for nearly
2,000 visual questions, such as image type and the required image processing
capabilities. Our zero-shot performance analysis highlights the types of
questions that are most challenging for both models, including questions
related to "puzzling" topic, with "Identification" user intention, with "Sheet
Music" image type, or labeled as "hard" by GPT-4.
Related papers
- VQA$^2$: Visual Question Answering for Video Quality Assessment [76.81110038738699]
Video Quality Assessment (VQA) is a classic field in low-level visual perception.
Recent studies in the image domain have demonstrated that Visual Question Answering (VQA) can enhance markedly low-level visual quality evaluation.
We introduce the VQA2 Instruction dataset - the first visual question answering instruction dataset that focuses on video quality assessment.
The VQA2 series models interleave visual and motion tokens to enhance the perception of spatial-temporal quality details in videos.
arXiv Detail & Related papers (2024-11-06T09:39:52Z) - MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities [146.4724093405187]
We introduce MM-Vet v2, which includes a new "image-text sequence understanding" capability called "image-text sequence understanding"
Using MM-Vet v2 to benchmark large multimodal models, we found that Claude 3.5 Sonnet is the best model with a score of 71.8, slightly outperforming GPT-4o which scored 71.0.
arXiv Detail & Related papers (2024-08-01T17:59:54Z) - Q-Ground: Image Quality Grounding with Large Multi-modality Models [61.72022069880346]
We introduce Q-Ground, the first framework aimed at tackling fine-scale visual quality grounding.
Q-Ground combines large multi-modality models with detailed visual quality analysis.
Central to our contribution is the introduction of the QGround-100K dataset.
arXiv Detail & Related papers (2024-07-24T06:42:46Z) - VISREAS: Complex Visual Reasoning with Unanswerable Questions [29.398956873585796]
We introduce a new compositional visual question-answering dataset, VISREAS.
It consists of answerable and unanswerable visual queries formulated by traversing and perturbing commonalities and differences among objects, attributes, and relations.
The unique feature of this task, validating question answerability with respect to an image before answering, and the poor performance of state-of-the-art models inspired the design of a new modular baseline, LOGIC2VISION.
arXiv Detail & Related papers (2024-02-23T00:12:10Z) - Gemini vs GPT-4V: A Preliminary Comparison and Combination of
Vision-Language Models Through Qualitative Cases [98.35348038111508]
This paper presents an in-depth comparative study of two pioneering models: Google's Gemini and OpenAI's GPT-4V(ision)
The core of our analysis delves into the distinct visual comprehension abilities of each model.
Our findings illuminate the unique strengths and niches of both models.
arXiv Detail & Related papers (2023-12-22T18:59:58Z) - GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection [51.43589678946244]
This paper explores the potential of VQA-oriented GPT-4V in the popular visual Anomaly Detection (AD) task.
It is the first to conduct qualitative and quantitative evaluations on the popular MVTec AD and VisA datasets.
arXiv Detail & Related papers (2023-11-05T10:01:18Z) - GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks [70.98062518872999]
We validate GPT-4V's capabilities for evaluation purposes, addressing tasks ranging from foundational image-to-text and text-to-image synthesis to high-level image-to-image translations and multi-images to text alignment.
Notably, GPT-4V shows promising agreement with humans across various tasks and evaluation methods, demonstrating immense potential for multi-modal LLMs as evaluators.
arXiv Detail & Related papers (2023-11-02T16:11:09Z) - Solution for SMART-101 Challenge of ICCV Multi-modal Algorithmic
Reasoning Task 2023 [13.326745559876558]
We present our solution to a Multi-modal Algorithmic Reasoning Task: SMART-101 Challenge.
This challenge evaluates the abstraction, deduction, and generalization abilities of neural networks in solving visuolinguistic puzzles.
Under the puzzle splits configuration, we achieved an accuracy score of 26.5 on the validation set and 24.30 on the private test set.
arXiv Detail & Related papers (2023-10-10T09:12:27Z) - Guiding Visual Question Generation [40.56637275354495]
In traditional Visual Question Generation (VQG), most images have multiple concepts for which a question could be generated.
We present Guiding Visual Question Generation - a variant of VQG which conditions the question generator on categorical information.
arXiv Detail & Related papers (2021-10-15T17:38:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.