An Evaluation of GPT-4V and Gemini in Online VQA
- URL: http://arxiv.org/abs/2312.10637v2
- Date: Wed, 14 Feb 2024 03:49:50 GMT
- Title: An Evaluation of GPT-4V and Gemini in Online VQA
- Authors: Mengchen Liu, Chongyan Chen, Danna Gurari
- Abstract summary: We evaluate two state-of-the-art LMMs, GPT-4V and Gemini, on a new visual question answering dataset.
We conduct fine-grained analysis by generating seven types of metadata for nearly 2,000 visual questions.
Our zero-shot performance analysis highlights the types of questions that are most challenging for both models.
- Score: 31.77015255871848
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While there is much excitement about the potential of large multimodal models
(LMM), a comprehensive evaluation is critical to establish their true
capabilities and limitations. In support of this aim, we evaluate two
state-of-the-art LMMs, GPT-4V and Gemini, on a new visual question answering
dataset sourced from an authentic online question answering community. We
conduct fine-grained analysis by generating seven types of metadata for nearly
2,000 visual questions, such as image type and the required image processing
capabilities. Our zero-shot performance analysis highlights the types of
questions that are most challenging for both models, including questions
related to "puzzling" topic, with "Identification" user intention, with "Sheet
Music" image type, or labeled as "hard" by GPT-4.
Related papers
- Seeing the Forest and the Trees: Solving Visual Graph and Tree Based Data Structure Problems using Large Multimodal Models [2.1894663332872932]
We investigate the capabilities of large multimodal models (LMMs) to solve graph and tree data structure problems based only on images.
GPT-4o and Gemini 1.5 Flash performed best on trees and graphs respectively.
Our findings highlight the influence of structural and visual variations on model performance.
arXiv Detail & Related papers (2024-12-15T07:15:19Z) - AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? [65.49972312524724]
multimodal large language models (MLLMs) have expanded their capabilities to include vision and audio modalities.
Our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial.
We introduce AV-Odyssey Bench, a comprehensive audio-visual benchmark designed to assess whether those MLLMs can truly understand the audio-visual information.
arXiv Detail & Related papers (2024-12-03T17:41:23Z) - VQA$^2$: Visual Question Answering for Video Quality Assessment [76.81110038738699]
Video Quality Assessment (VQA) is a classic field in low-level visual perception.
Recent studies in the image domain have demonstrated that Visual Question Answering (VQA) can enhance markedly low-level visual quality evaluation.
We introduce the VQA2 Instruction dataset - the first visual question answering instruction dataset that focuses on video quality assessment.
The VQA2 series models interleave visual and motion tokens to enhance the perception of spatial-temporal quality details in videos.
arXiv Detail & Related papers (2024-11-06T09:39:52Z) - Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types [0.9217021281095907]
We present a novel dataset derived from established VQA benchmarks, annotated with task types, application domains, and knowledge types, for a comprehensive evaluation.
We also introduce GoEval, a multimodal evaluation metric developed using GPT-4o, achieving a correlation factor of 56.71% with human judgments.
arXiv Detail & Related papers (2024-09-14T02:29:36Z) - Q-Ground: Image Quality Grounding with Large Multi-modality Models [61.72022069880346]
We introduce Q-Ground, the first framework aimed at tackling fine-scale visual quality grounding.
Q-Ground combines large multi-modality models with detailed visual quality analysis.
Central to our contribution is the introduction of the QGround-100K dataset.
arXiv Detail & Related papers (2024-07-24T06:42:46Z) - Gemini vs GPT-4V: A Preliminary Comparison and Combination of
Vision-Language Models Through Qualitative Cases [98.35348038111508]
This paper presents an in-depth comparative study of two pioneering models: Google's Gemini and OpenAI's GPT-4V(ision)
The core of our analysis delves into the distinct visual comprehension abilities of each model.
Our findings illuminate the unique strengths and niches of both models.
arXiv Detail & Related papers (2023-12-22T18:59:58Z) - GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection [51.43589678946244]
This paper explores the potential of VQA-oriented GPT-4V in the popular visual Anomaly Detection (AD) task.
It is the first to conduct qualitative and quantitative evaluations on the popular MVTec AD and VisA datasets.
arXiv Detail & Related papers (2023-11-05T10:01:18Z) - GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks [70.98062518872999]
We validate GPT-4V's capabilities for evaluation purposes, addressing tasks ranging from foundational image-to-text and text-to-image synthesis to high-level image-to-image translations and multi-images to text alignment.
Notably, GPT-4V shows promising agreement with humans across various tasks and evaluation methods, demonstrating immense potential for multi-modal LLMs as evaluators.
arXiv Detail & Related papers (2023-11-02T16:11:09Z) - Solution for SMART-101 Challenge of ICCV Multi-modal Algorithmic
Reasoning Task 2023 [13.326745559876558]
We present our solution to a Multi-modal Algorithmic Reasoning Task: SMART-101 Challenge.
This challenge evaluates the abstraction, deduction, and generalization abilities of neural networks in solving visuolinguistic puzzles.
Under the puzzle splits configuration, we achieved an accuracy score of 26.5 on the validation set and 24.30 on the private test set.
arXiv Detail & Related papers (2023-10-10T09:12:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.