Related papers: An Evaluation of GPT-4V and Gemini in Online VQA

An Evaluation of GPT-4V and Gemini in Online VQA

URL: http://arxiv.org/abs/2312.10637v2
Date: Wed, 14 Feb 2024 03:49:50 GMT
Title: An Evaluation of GPT-4V and Gemini in Online VQA
Authors: Mengchen Liu, Chongyan Chen, Danna Gurari
Abstract summary: We evaluate two state-of-the-art LMMs, GPT-4V and Gemini, on a new visual question answering dataset. We conduct fine-grained analysis by generating seven types of metadata for nearly 2,000 visual questions. Our zero-shot performance analysis highlights the types of questions that are most challenging for both models.
Score: 31.77015255871848
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While there is much excitement about the potential of large multimodal models (LMM), a comprehensive evaluation is critical to establish their true capabilities and limitations. In support of this aim, we evaluate two state-of-the-art LMMs, GPT-4V and Gemini, on a new visual question answering dataset sourced from an authentic online question answering community. We conduct fine-grained analysis by generating seven types of metadata for nearly 2,000 visual questions, such as image type and the required image processing capabilities. Our zero-shot performance analysis highlights the types of questions that are most challenging for both models, including questions related to "puzzling" topic, with "Identification" user intention, with "Sheet Music" image type, or labeled as "hard" by GPT-4.

Related papers

IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs [36.76252153495239]
IV-Bench is the first comprehensive benchmark for evaluating Image-Grounded Video Perception and Reasoning. IV-Bench consists of 967 videos paired with 2,585 meticulously annotated image-text queries across 13 tasks.
arXiv Detail & Related papers (2025-04-21T19:53:44Z)
A Unified Agentic Framework for Evaluating Conditional Image Generation [66.25099219134441]
Conditional image generation has gained significant attention for its ability to personalize content. This paper introduces CIGEval, a unified agentic framework for comprehensive evaluation of conditional image generation tasks.
arXiv Detail & Related papers (2025-04-09T17:04:14Z)
Seeing the Forest and the Trees: Solving Visual Graph and Tree Based Data Structure Problems using Large Multimodal Models [2.1894663332872932]
We investigate the capabilities of large multimodal models (LMMs) to solve graph and tree data structure problems based only on images. GPT-4o and Gemini 1.5 Flash performed best on trees and graphs respectively. Our findings highlight the influence of structural and visual variations on model performance.
arXiv Detail & Related papers (2024-12-15T07:15:19Z)
VQA$^2$: Visual Question Answering for Video Quality Assessment [76.81110038738699]
Video Quality Assessment (VQA) is a classic field in low-level visual perception. Recent studies in the image domain have demonstrated that Visual Question Answering (VQA) can enhance markedly low-level visual quality evaluation. We introduce the VQA2 Instruction dataset - the first visual question answering instruction dataset that focuses on video quality assessment. The VQA2 series models interleave visual and motion tokens to enhance the perception of spatial-temporal quality details in videos.
arXiv Detail & Related papers (2024-11-06T09:39:52Z)
Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types [0.9217021281095907]
We present a novel dataset derived from established VQA benchmarks, annotated with task types, application domains, and knowledge types, for a comprehensive evaluation. We also introduce GoEval, a multimodal evaluation metric developed using GPT-4o, achieving a correlation factor of 56.71% with human judgments.
arXiv Detail & Related papers (2024-09-14T02:29:36Z)
MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities [146.4724093405187]
We introduce MM-Vet v2, which includes a new "image-text sequence understanding" capability called "image-text sequence understanding" Using MM-Vet v2 to benchmark large multimodal models, we found that Claude 3.5 Sonnet is the best model with a score of 71.8, slightly outperforming GPT-4o which scored 71.0.
arXiv Detail & Related papers (2024-08-01T17:59:54Z)
Q-Ground: Image Quality Grounding with Large Multi-modality Models [61.72022069880346]
We introduce Q-Ground, the first framework aimed at tackling fine-scale visual quality grounding. Q-Ground combines large multi-modality models with detailed visual quality analysis. Central to our contribution is the introduction of the QGround-100K dataset.
arXiv Detail & Related papers (2024-07-24T06:42:46Z)
VISREAS: Complex Visual Reasoning with Unanswerable Questions [29.398956873585796]
We introduce a new compositional visual question-answering dataset, VISREAS. It consists of answerable and unanswerable visual queries formulated by traversing and perturbing commonalities and differences among objects, attributes, and relations. The unique feature of this task, validating question answerability with respect to an image before answering, and the poor performance of state-of-the-art models inspired the design of a new modular baseline, LOGIC2VISION.
arXiv Detail & Related papers (2024-02-23T00:12:10Z)
Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases [98.35348038111508]
This paper presents an in-depth comparative study of two pioneering models: Google's Gemini and OpenAI's GPT-4V(ision) The core of our analysis delves into the distinct visual comprehension abilities of each model. Our findings illuminate the unique strengths and niches of both models.
arXiv Detail & Related papers (2023-12-22T18:59:58Z)
GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection [51.43589678946244]
This paper explores the potential of VQA-oriented GPT-4V in the popular visual Anomaly Detection (AD) task. It is the first to conduct qualitative and quantitative evaluations on the popular MVTec AD and VisA datasets.
arXiv Detail & Related papers (2023-11-05T10:01:18Z)
GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks [70.98062518872999]
We validate GPT-4V's capabilities for evaluation purposes, addressing tasks ranging from foundational image-to-text and text-to-image synthesis to high-level image-to-image translations and multi-images to text alignment. Notably, GPT-4V shows promising agreement with humans across various tasks and evaluation methods, demonstrating immense potential for multi-modal LLMs as evaluators.
arXiv Detail & Related papers (2023-11-02T16:11:09Z)
Solution for SMART-101 Challenge of ICCV Multi-modal Algorithmic Reasoning Task 2023 [13.326745559876558]
We present our solution to a Multi-modal Algorithmic Reasoning Task: SMART-101 Challenge. This challenge evaluates the abstraction, deduction, and generalization abilities of neural networks in solving visuolinguistic puzzles. Under the puzzle splits configuration, we achieved an accuracy score of 26.5 on the validation set and 24.30 on the private test set.
arXiv Detail & Related papers (2023-10-10T09:12:27Z)
Guiding Visual Question Generation [40.56637275354495]
In traditional Visual Question Generation (VQG), most images have multiple concepts for which a question could be generated. We present Guiding Visual Question Generation - a variant of VQG which conditions the question generator on categorical information.
arXiv Detail & Related papers (2021-10-15T17:38:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.