Gemini Pro Defeated by GPT-4V: Evidence from Education
- URL: http://arxiv.org/abs/2401.08660v1
- Date: Wed, 27 Dec 2023 02:56:41 GMT
- Title: Gemini Pro Defeated by GPT-4V: Evidence from Education
- Authors: Gyeong-Geon Lee, Ehsan Latif, Lehong Shi, and Xiaoming Zhai
- Abstract summary: GPT-4V significantly outperforms Gemini Pro in terms of scoring accuracy and Quadratic Weighted Kappa.
Findings suggest GPT-4V's superior capability in handling complex educational tasks.
- Score: 1.0226894006814744
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This study compared the classification performance of Gemini Pro and GPT-4V
in educational settings. Employing visual question answering (VQA) techniques,
the study examined both models' abilities to read text-based rubrics and then
automatically score student-drawn models in science education. We employed both
quantitative and qualitative analyses using a dataset derived from
student-drawn scientific models and employing NERIF (Notation-Enhanced Rubrics
for Image Feedback) prompting methods. The findings reveal that GPT-4V
significantly outperforms Gemini Pro in terms of scoring accuracy and Quadratic
Weighted Kappa. The qualitative analysis reveals that the differences may be
due to the models' ability to process fine-grained texts in images and overall
image classification performance. Even adapting the NERIF approach by further
de-sizing the input images, Gemini Pro seems not able to perform as well as
GPT-4V. The findings suggest GPT-4V's superior capability in handling complex
multimodal educational tasks. The study concludes that while both models
represent advancements in AI, GPT-4V's higher performance makes it a more
suitable tool for educational applications involving multimodal data
interpretation.
Related papers
- Evaluating GPT-4 at Grading Handwritten Solutions in Math Exams [48.99818550820575]
We leverage state-of-the-art multi-modal AI models, in particular GPT-4o, to automatically grade handwritten responses to college-level math exams.
Using real student responses to questions in a probability theory exam, we evaluate GPT-4o's alignment with ground-truth scores from human graders using various prompting techniques.
arXiv Detail & Related papers (2024-11-07T22:51:47Z) - Gemini vs GPT-4V: A Preliminary Comparison and Combination of
Vision-Language Models Through Qualitative Cases [98.35348038111508]
This paper presents an in-depth comparative study of two pioneering models: Google's Gemini and OpenAI's GPT-4V(ision)
The core of our analysis delves into the distinct visual comprehension abilities of each model.
Our findings illuminate the unique strengths and niches of both models.
arXiv Detail & Related papers (2023-12-22T18:59:58Z) - GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition? [82.40761196684524]
This paper centers on the evaluation of GPT-4's linguistic and visual capabilities in zero-shot visual recognition tasks.
We conduct extensive experiments to evaluate GPT-4's performance across images, videos, and point clouds.
Our findings show that GPT-4, enhanced with rich linguistic descriptions, significantly improves zero-shot recognition.
arXiv Detail & Related papers (2023-11-27T11:29:10Z) - NERIF: GPT-4V for Automatic Scoring of Drawn Models [0.6278186810520364]
Recently released GPT-4V provides a unique opportunity to advance scientific modeling practices.
We developed a method employing instructional note and rubrics to prompt GPT-4V to score students' drawn models.
GPT-4V scores were compared with human experts' scores to calculate scoring accuracy.
arXiv Detail & Related papers (2023-11-21T20:52:04Z) - GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks [70.98062518872999]
We validate GPT-4V's capabilities for evaluation purposes, addressing tasks ranging from foundational image-to-text and text-to-image synthesis to high-level image-to-image translations and multi-images to text alignment.
Notably, GPT-4V shows promising agreement with humans across various tasks and evaluation methods, demonstrating immense potential for multi-modal LLMs as evaluators.
arXiv Detail & Related papers (2023-11-02T16:11:09Z) - An Early Evaluation of GPT-4V(ision) [40.866323649060696]
We evaluate different abilities of GPT-4V including visual understanding, language understanding, visual puzzle solving, and understanding of other modalities such as depth, thermal, video, and audio.
To estimate GPT-4V's performance, we manually construct 656 test instances and carefully evaluate the results of GPT-4V.
arXiv Detail & Related papers (2023-10-25T10:33:17Z) - The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) [121.42924593374127]
We analyze the latest model, GPT-4V, to deepen the understanding of LMMs.
GPT-4V's unprecedented ability in processing arbitrarily interleaved multimodal inputs makes it a powerful multimodal generalist system.
GPT-4V's unique capability of understanding visual markers drawn on input images can give rise to new human-computer interaction methods.
arXiv Detail & Related papers (2023-09-29T17:34:51Z) - GPT-4 Technical Report [116.90398195245983]
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs.
It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
arXiv Detail & Related papers (2023-03-15T17:15:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.