An Early Evaluation of GPT-4V(ision)
- URL: http://arxiv.org/abs/2310.16534v1
- Date: Wed, 25 Oct 2023 10:33:17 GMT
- Title: An Early Evaluation of GPT-4V(ision)
- Authors: Yang Wu, Shilong Wang, Hao Yang, Tian Zheng, Hongbo Zhang, Yanyan
Zhao, Bing Qin
- Abstract summary: We evaluate different abilities of GPT-4V including visual understanding, language understanding, visual puzzle solving, and understanding of other modalities such as depth, thermal, video, and audio.
To estimate GPT-4V's performance, we manually construct 656 test instances and carefully evaluate the results of GPT-4V.
- Score: 40.866323649060696
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we evaluate different abilities of GPT-4V including visual
understanding, language understanding, visual puzzle solving, and understanding
of other modalities such as depth, thermal, video, and audio. To estimate
GPT-4V's performance, we manually construct 656 test instances and carefully
evaluate the results of GPT-4V. The highlights of our findings are as follows:
(1) GPT-4V exhibits impressive performance on English visual-centric benchmarks
but fails to recognize simple Chinese texts in the images; (2) GPT-4V shows
inconsistent refusal behavior when answering questions related to sensitive
traits such as gender, race, and age; (3) GPT-4V obtains worse results than
GPT-4 (API) on language understanding tasks including general language
understanding benchmarks and visual commonsense knowledge evaluation
benchmarks; (4) Few-shot prompting can improve GPT-4V's performance on both
visual understanding and language understanding; (5) GPT-4V struggles to find
the nuances between two similar images and solve the easy math picture puzzles;
(6) GPT-4V shows non-trivial performance on the tasks of similar modalities to
image, such as video and thermal. Our experimental results reveal the ability
and limitations of GPT-4V and we hope our paper can provide some insights into
the application and research of GPT-4V.
Related papers
- GPT-4o System Card [211.87336862081963]
GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video.
It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network.
It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages.
arXiv Detail & Related papers (2024-10-25T17:43:01Z) - Notes on Applicability of GPT-4 to Document Understanding [0.0]
We evaluate all publicly available GPT-4 family models concerning the Document Understanding field.
Benchmark results indicate that though it is hard to achieve satisfactory results with text-only models, GPT-4 Vision Turbo performs well when one provides both text recognized by an external OCR engine and document images on the input.
arXiv Detail & Related papers (2024-05-28T17:59:53Z) - Gemini Pro Defeated by GPT-4V: Evidence from Education [1.0226894006814744]
GPT-4V significantly outperforms Gemini Pro in terms of scoring accuracy and Quadratic Weighted Kappa.
Findings suggest GPT-4V's superior capability in handling complex educational tasks.
arXiv Detail & Related papers (2023-12-27T02:56:41Z) - GPT-4V with Emotion: A Zero-shot Benchmark for Generalized Emotion Recognition [38.2581985358104]
GPT-4 with Vision (GPT-4V) has demonstrated remarkable visual capabilities across various tasks, but its performance in emotion recognition has not been fully evaluated.
We present the quantitative evaluation results of GPT-4V on 21 benchmark datasets covering 6 tasks.
arXiv Detail & Related papers (2023-12-07T13:27:37Z) - GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition? [82.40761196684524]
This paper centers on the evaluation of GPT-4's linguistic and visual capabilities in zero-shot visual recognition tasks.
We conduct extensive experiments to evaluate GPT-4's performance across images, videos, and point clouds.
Our findings show that GPT-4, enhanced with rich linguistic descriptions, significantly improves zero-shot recognition.
arXiv Detail & Related papers (2023-11-27T11:29:10Z) - Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks [53.936643052339]
We evaluate the reasoning abilities of text-only and multimodal versions of GPT-4.
Our experimental results support the conclusion that neither version of GPT-4 has developed robust abstraction abilities at humanlike levels.
arXiv Detail & Related papers (2023-11-14T04:33:49Z) - Holistic Analysis of Hallucination in GPT-4V(ision): Bias and
Interference Challenges [54.42256219010956]
This benchmark is designed to evaluate and shed light on the two common types of hallucinations in visual language models: bias and interference.
bias refers to the model's tendency to hallucinate certain types of responses, possibly due to imbalance in its training data.
interference pertains to scenarios where the judgment of GPT-4V(ision) can be disrupted due to how the text prompt is phrased or how the input image is presented.
arXiv Detail & Related papers (2023-11-06T17:26:59Z) - GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks [70.98062518872999]
We validate GPT-4V's capabilities for evaluation purposes, addressing tasks ranging from foundational image-to-text and text-to-image synthesis to high-level image-to-image translations and multi-images to text alignment.
Notably, GPT-4V shows promising agreement with humans across various tasks and evaluation methods, demonstrating immense potential for multi-modal LLMs as evaluators.
arXiv Detail & Related papers (2023-11-02T16:11:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.