GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?
- URL: http://arxiv.org/abs/2311.15732v2
- Date: Tue, 12 Mar 2024 01:07:14 GMT
- Title: GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?
- Authors: Wenhao Wu, Huanjin Yao, Mengxi Zhang, Yuxin Song, Wanli Ouyang,
Jingdong Wang
- Abstract summary: This paper centers on the evaluation of GPT-4's linguistic and visual capabilities in zero-shot visual recognition tasks.
We conduct extensive experiments to evaluate GPT-4's performance across images, videos, and point clouds.
Our findings show that GPT-4, enhanced with rich linguistic descriptions, significantly improves zero-shot recognition.
- Score: 82.40761196684524
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper does not present a novel method. Instead, it delves into an
essential, yet must-know baseline in light of the latest advancements in
Generative Artificial Intelligence (GenAI): the utilization of GPT-4 for visual
understanding. Our study centers on the evaluation of GPT-4's linguistic and
visual capabilities in zero-shot visual recognition tasks: Firstly, we explore
the potential of its generated rich textual descriptions across various
categories to enhance recognition performance without any training. Secondly,
we evaluate GPT-4's visual proficiency in directly recognizing diverse visual
content. We conducted extensive experiments to systematically evaluate GPT-4's
performance across images, videos, and point clouds, using 16 benchmark
datasets to measure top-1 and top-5 accuracy. Our findings show that GPT-4,
enhanced with rich linguistic descriptions, significantly improves zero-shot
recognition, offering an average top-1 accuracy increase of 7% across all
datasets. GPT-4 excels in visual recognition, outshining OpenAI-CLIP's ViT-L
and rivaling EVA-CLIP's ViT-E, particularly in video datasets HMDB-51 and
UCF-101, where it leads by 22% and 9%, respectively. We hope this research
contributes valuable data points and experience for future studies. We release
our code at https://github.com/whwu95/GPT4Vis.
Related papers
- Exploiting GPT-4 Vision for Zero-shot Point Cloud Understanding [114.4754255143887]
We tackle the challenge of classifying the object category in point clouds.
We employ GPT-4 Vision (GPT-4V) to overcome these challenges.
We set a new benchmark in zero-shot point cloud classification.
arXiv Detail & Related papers (2024-01-15T10:16:44Z) - GPT-4V with Emotion: A Zero-shot Benchmark for Generalized Emotion Recognition [38.2581985358104]
GPT-4 with Vision (GPT-4V) has demonstrated remarkable visual capabilities across various tasks, but its performance in emotion recognition has not been fully evaluated.
We present the quantitative evaluation results of GPT-4V on 21 benchmark datasets covering 6 tasks.
arXiv Detail & Related papers (2023-12-07T13:27:37Z) - GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection [51.43589678946244]
This paper explores the potential of VQA-oriented GPT-4V in the popular visual Anomaly Detection (AD) task.
It is the first to conduct qualitative and quantitative evaluations on the popular MVTec AD and VisA datasets.
arXiv Detail & Related papers (2023-11-05T10:01:18Z) - An Early Evaluation of GPT-4V(ision) [40.866323649060696]
We evaluate different abilities of GPT-4V including visual understanding, language understanding, visual puzzle solving, and understanding of other modalities such as depth, thermal, video, and audio.
To estimate GPT-4V's performance, we manually construct 656 test instances and carefully evaluate the results of GPT-4V.
arXiv Detail & Related papers (2023-10-25T10:33:17Z) - Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts [13.486599520658919]
GPT-4 can be used to generate text that is visually descriptive.
We show considerable improvements in 0-shot transfer accuracy on specialized fine-grained datasets.
arXiv Detail & Related papers (2023-07-21T15:49:59Z) - Visual Instruction Tuning [79.70923292053097]
We present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data.
By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant.
When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%.
arXiv Detail & Related papers (2023-04-17T17:59:25Z) - Sparks of Artificial General Intelligence: Early experiments with GPT-4 [66.1188263570629]
GPT-4, developed by OpenAI, was trained using an unprecedented scale of compute and data.
We demonstrate that GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more.
We believe GPT-4 could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system.
arXiv Detail & Related papers (2023-03-22T16:51:28Z) - GPT-4 Technical Report [116.90398195245983]
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs.
It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
arXiv Detail & Related papers (2023-03-15T17:15:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.