Related papers: Evaluating ChatGPT-4 Vision on Brazil's National Undergraduate Computer Science Exam

Evaluating ChatGPT-4 Vision on Brazil's National Undergraduate Computer Science Exam

URL: http://arxiv.org/abs/2406.09671v1
Date: Fri, 14 Jun 2024 02:42:30 GMT
Title: Evaluating ChatGPT-4 Vision on Brazil's National Undergraduate Computer Science Exam
Authors: Nabor C. Mendonça,
Abstract summary: This study investigates the performance of ChatGPT-4 Vision, OpenAI's most advanced visual model. By presenting the model with the exam's open and multiple-choice questions in their original image format, we were able to evaluate the model's reasoning and self-reflecting capabilities. ChatGPT-4 Vision significantly outperformed the average exam participant, positioning itself within the top 10 best score percentile.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The recent integration of visual capabilities into Large Language Models (LLMs) has the potential to play a pivotal role in science and technology education, where visual elements such as diagrams, charts, and tables are commonly used to improve the learning experience. This study investigates the performance of ChatGPT-4 Vision, OpenAI's most advanced visual model at the time the study was conducted, on the Bachelor in Computer Science section of Brazil's 2021 National Undergraduate Exam (ENADE). By presenting the model with the exam's open and multiple-choice questions in their original image format and allowing for reassessment in response to differing answer keys, we were able to evaluate the model's reasoning and self-reflecting capabilities in a large-scale academic assessment involving textual and visual content. ChatGPT-4 Vision significantly outperformed the average exam participant, positioning itself within the top 10 best score percentile. While it excelled in questions that incorporated visual elements, it also encountered challenges with question interpretation, logical reasoning, and visual acuity. The involvement of an independent expert panel to review cases of disagreement between the model and the answer key revealed some poorly constructed questions containing vague or ambiguous statements, calling attention to the critical need for improved question design in future exams. Our findings suggest that while ChatGPT-4 Vision shows promise in multimodal academic evaluations, human oversight remains crucial for verifying the model's accuracy and ensuring the fairness of high-stakes educational exams. The paper's research materials are publicly available at https://github.com/nabormendonca/gpt-4v-enade-cs-2021.

Related papers

SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning [89.48883747910448]
We present SeePhys, a large-scale multimodal benchmark for reasoning grounded in physics questions.<n>The benchmark covers 7 fundamental domains spanning the physics discipline, incorporating 21 categories of highly heterogeneous diagrams.<n>We observe that even the most advanced visual reasoning models (e.g., Gemini-2.5-pro and o4-mini) achieve sub-60% accuracy on our benchmark.
arXiv Detail & Related papers (2025-05-25T11:28:34Z)
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing [90.65399476233495]
We introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE) RISEBench focuses on four key reasoning types: Temporal, Causal, Spatial, and Logical Reasoning. We propose an evaluation framework that assesses Instruction Reasoning, Appearance Consistency, and Visual Plausibility with both human judges and an LMM-as-a-judge approach.
arXiv Detail & Related papers (2025-04-03T17:59:56Z)
MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models [115.16022378880376]
We introduce a multimodal retrieval-augmented generation benchmark, MRAG-Bench. MRAG-Bench consists of 16,130 images and 1,353 human-annotated multiple-choice questions. Results show that all large vision-language models (LVLMs) exhibit greater improvements when augmented with images compared to textual knowledge.
arXiv Detail & Related papers (2024-10-10T17:55:02Z)
Putting GPT-4o to the Sword: A Comprehensive Evaluation of Language, Vision, Speech, and Multimodal Proficiency [3.161954199291541]
This research study comprehensively evaluates the language, vision, speech, and multimodal capabilities of GPT-4o. GPT-4o demonstrates high accuracy and efficiency across multiple domains in language and reasoning capabilities. The model shows variability and faces limitations in handling complex and ambiguous inputs.
arXiv Detail & Related papers (2024-06-19T19:00:21Z)
Realizing Visual Question Answering for Education: GPT-4V as a Multimodal AI [0.6278186810520364]
Most qualitative analysis of and explanation on image data have been conducted by human researchers, without machine-based automation. The recent development of Visual Question Answering (VQA) techniques is accomplishing usable visual language models. This paper aims to introduce VQA for educational studies so that it provides a milestone for educational research methodology.
arXiv Detail & Related papers (2024-05-12T05:05:31Z)
Assessing the Aesthetic Evaluation Capabilities of GPT-4 with Vision: Insights from Group and Individual Assessments [2.539875353011627]
This study investigates the performance of GPT-4 with Vision on the task of aesthetic evaluation of images. We employ two tasks, prediction of the average evaluation values of a group and an individual's evaluation values. Experimental results reveal GPT-4 with Vision's superior performance in predicting aesthetic evaluations and the nature of different responses to beauty and ugliness.
arXiv Detail & Related papers (2024-03-06T10:27:09Z)
ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models [92.60282074937305]
We introduce ConTextual, a novel dataset featuring human-crafted instructions that require context-sensitive reasoning for text-rich images. We conduct experiments to assess the performance of 14 foundation models and establish a human performance baseline. We observe a significant performance gap of 30.8% between GPT-4V and human performance.
arXiv Detail & Related papers (2024-01-24T09:07:11Z)
Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases [98.35348038111508]
This paper presents an in-depth comparative study of two pioneering models: Google's Gemini and OpenAI's GPT-4V(ision) The core of our analysis delves into the distinct visual comprehension abilities of each model. Our findings illuminate the unique strengths and niches of both models.
arXiv Detail & Related papers (2023-12-22T18:59:58Z)
Evaluating GPT-4's Vision Capabilities on Brazilian University Admission Exams [14.801853435122908]
We present a framework to evaluate language models on entrance exams, which incorporates both textual and visual elements. We evaluate the two most recent editions of Exame Nacional do Ensino M'edio (ENEM), the main standardized entrance examination adopted by Brazilian universities. One of the highlights is that text captions transcribing visual content outperform the direct use of images, suggesting that the vision model has room for improvement.
arXiv Detail & Related papers (2023-11-23T19:20:59Z)
Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models [81.20804369985376]
We conduct a large-scale subjective experiment collecting a vast number of real human feedbacks on low-level vision. The constructed **Q-Pathway** dataset includes 58K detailed human feedbacks on 18,973 images. We design a GPT-participated conversion to process these feedbacks into diverse-format 200K instruction-response pairs.
arXiv Detail & Related papers (2023-11-12T09:10:51Z)
Lost in Translation: When GPT-4V(ision) Can't See Eye to Eye with Text. A Vision-Language-Consistency Analysis of VLLMs and Beyond [7.760124498553333]
We study whether vision-language models execute vision and language tasks consistently or independently. We introduce a systematic framework that quantifies the capability disparities between different modalities in the multi-modal setting. We introduce "Vision Description Prompting," a method that effectively improves performance in challenging vision-related tasks.
arXiv Detail & Related papers (2023-10-19T06:45:11Z)
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts [170.01089233942594]
MathVista is a benchmark designed to combine challenges from diverse mathematical and visual tasks. The best-performing GPT-4V model achieves an overall accuracy of 49.9%, substantially outperforming Bard, the second-best performer, by 15.1%. GPT-4V still falls short of human performance by 10.4%, as it often struggles to understand complex figures and perform rigorous reasoning.
arXiv Detail & Related papers (2023-10-03T17:57:24Z)
Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? [50.29862466940209]
We introduce InfoSeek, a visual question answering dataset tailored for information-seeking questions. We analyze various pre-trained visual question answering models and gain insights into their characteristics. We show that accurate visual entity recognition can be used to improve performance on InfoSeek by retrieving relevant documents.
arXiv Detail & Related papers (2023-02-23T00:33:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.