An Evaluation of GPT-4V and Gemini in Online VQA
- URL: http://arxiv.org/abs/2312.10637v2
- Date: Wed, 14 Feb 2024 03:49:50 GMT
- Title: An Evaluation of GPT-4V and Gemini in Online VQA
- Authors: Mengchen Liu, Chongyan Chen, Danna Gurari
- Abstract summary: We evaluate two state-of-the-art LMMs, GPT-4V and Gemini, on a new visual question answering dataset.
We conduct fine-grained analysis by generating seven types of metadata for nearly 2,000 visual questions.
Our zero-shot performance analysis highlights the types of questions that are most challenging for both models.
- Score: 31.77015255871848
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While there is much excitement about the potential of large multimodal models
(LMM), a comprehensive evaluation is critical to establish their true
capabilities and limitations. In support of this aim, we evaluate two
state-of-the-art LMMs, GPT-4V and Gemini, on a new visual question answering
dataset sourced from an authentic online question answering community. We
conduct fine-grained analysis by generating seven types of metadata for nearly
2,000 visual questions, such as image type and the required image processing
capabilities. Our zero-shot performance analysis highlights the types of
questions that are most challenging for both models, including questions
related to "puzzling" topic, with "Identification" user intention, with "Sheet
Music" image type, or labeled as "hard" by GPT-4.
Related papers
- Q-Ground: Image Quality Grounding with Large Multi-modality Models [61.72022069880346]
We introduce Q-Ground, the first framework aimed at tackling fine-scale visual quality grounding.
Q-Ground combines large multi-modality models with detailed visual quality analysis.
Central to our contribution is the introduction of the QGround-100K dataset.
arXiv Detail & Related papers (2024-07-24T06:42:46Z) - Visual Haystacks: Answering Harder Questions About Sets of Images [63.296342841358815]
This paper explores the task of Multi-Image Visual Question Answering (MIQA)
Given a large set of images and a natural language query, the task is to generate a relevant and grounded response.
We introduce MIRAGE, a novel retrieval/QA framework tailored for Large Multimodal Models (LMMs)
arXiv Detail & Related papers (2024-07-18T17:59:30Z) - VISREAS: Complex Visual Reasoning with Unanswerable Questions [29.398956873585796]
We introduce a new compositional visual question-answering dataset, VISREAS.
It consists of answerable and unanswerable visual queries formulated by traversing and perturbing commonalities and differences among objects, attributes, and relations.
The unique feature of this task, validating question answerability with respect to an image before answering, and the poor performance of state-of-the-art models inspired the design of a new modular baseline, LOGIC2VISION.
arXiv Detail & Related papers (2024-02-23T00:12:10Z) - CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations [61.21923643289266]
Chain of Manipulations is a mechanism that enables Vision-Language Models to solve problems step-by-step with evidence.
After training, models can solve various visual problems by eliciting intrinsic manipulations (e.g., grounding, zoom in) actively without involving external tools.
Our trained model, textbfCogCoM, achieves state-of-the-art performance across 9 benchmarks from 4 categories.
arXiv Detail & Related papers (2024-02-06T18:43:48Z) - Gemini vs GPT-4V: A Preliminary Comparison and Combination of
Vision-Language Models Through Qualitative Cases [98.35348038111508]
This paper presents an in-depth comparative study of two pioneering models: Google's Gemini and OpenAI's GPT-4V(ision)
The core of our analysis delves into the distinct visual comprehension abilities of each model.
Our findings illuminate the unique strengths and niches of both models.
arXiv Detail & Related papers (2023-12-22T18:59:58Z) - GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection [51.43589678946244]
This paper explores the potential of VQA-oriented GPT-4V in the popular visual Anomaly Detection (AD) task.
It is the first to conduct qualitative and quantitative evaluations on the popular MVTec AD and VisA datasets.
arXiv Detail & Related papers (2023-11-05T10:01:18Z) - GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks [70.98062518872999]
We validate GPT-4V's capabilities for evaluation purposes, addressing tasks ranging from foundational image-to-text and text-to-image synthesis to high-level image-to-image translations and multi-images to text alignment.
Notably, GPT-4V shows promising agreement with humans across various tasks and evaluation methods, demonstrating immense potential for multi-modal LLMs as evaluators.
arXiv Detail & Related papers (2023-11-02T16:11:09Z) - Solution for SMART-101 Challenge of ICCV Multi-modal Algorithmic
Reasoning Task 2023 [13.326745559876558]
We present our solution to a Multi-modal Algorithmic Reasoning Task: SMART-101 Challenge.
This challenge evaluates the abstraction, deduction, and generalization abilities of neural networks in solving visuolinguistic puzzles.
Under the puzzle splits configuration, we achieved an accuracy score of 26.5 on the validation set and 24.30 on the private test set.
arXiv Detail & Related papers (2023-10-10T09:12:27Z) - Guiding Visual Question Generation [40.56637275354495]
In traditional Visual Question Generation (VQG), most images have multiple concepts for which a question could be generated.
We present Guiding Visual Question Generation - a variant of VQG which conditions the question generator on categorical information.
arXiv Detail & Related papers (2021-10-15T17:38:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.