Evaluating GPT-4's Vision Capabilities on Brazilian University Admission
Exams
- URL: http://arxiv.org/abs/2311.14169v1
- Date: Thu, 23 Nov 2023 19:20:59 GMT
- Title: Evaluating GPT-4's Vision Capabilities on Brazilian University Admission
Exams
- Authors: Ramon Pires, Thales Sales Almeida, Hugo Abonizio, Rodrigo Nogueira
- Abstract summary: We present a framework to evaluate language models on entrance exams, which incorporates both textual and visual elements.
We evaluate the two most recent editions of Exame Nacional do Ensino M'edio (ENEM), the main standardized entrance examination adopted by Brazilian universities.
One of the highlights is that text captions transcribing visual content outperform the direct use of images, suggesting that the vision model has room for improvement.
- Score: 14.801853435122908
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in language models have showcased human-comparable
performance in academic entrance exams. However, existing studies often
overlook questions that require the integration of visual comprehension, thus
compromising the full spectrum and complexity inherent in real-world scenarios.
To address this gap, we present a comprehensive framework to evaluate language
models on entrance exams, which incorporates both textual and visual elements.
We evaluate the two most recent editions of Exame Nacional do Ensino M\'edio
(ENEM), the main standardized entrance examination adopted by Brazilian
universities. Our study not only reaffirms the capabilities of GPT-4 as the
state of the art for handling complex multidisciplinary questions, but also
pioneers in offering a realistic assessment of multimodal language models on
Portuguese examinations. One of the highlights is that text captions
transcribing visual content outperform the direct use of images, suggesting
that the vision model has room for improvement. Yet, despite improvements
afforded by images or captions, mathematical questions remain a challenge for
these state-of-the-art models. The code and data used on experiments are
available at https://github.com/piresramon/gpt-4-enem.
Related papers
- Putting GPT-4o to the Sword: A Comprehensive Evaluation of Language, Vision, Speech, and Multimodal Proficiency [3.161954199291541]
This research study comprehensively evaluates the language, vision, speech, and multimodal capabilities of GPT-4o.
GPT-4o demonstrates high accuracy and efficiency across multiple domains in language and reasoning capabilities.
The model shows variability and faces limitations in handling complex and ambiguous inputs.
arXiv Detail & Related papers (2024-06-19T19:00:21Z) - Evaluating ChatGPT-4 Vision on Brazil's National Undergraduate Computer Science Exam [0.0]
This study investigates the performance of ChatGPT-4 Vision, OpenAI's most advanced visual model.
By presenting the model with the exam's open and multiple-choice questions in their original image format, we were able to evaluate the model's reasoning and self-reflecting capabilities.
ChatGPT-4 Vision significantly outperformed the average exam participant, positioning itself within the top 10 best score percentile.
arXiv Detail & Related papers (2024-06-14T02:42:30Z) - ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models [92.60282074937305]
We introduce ConTextual, a novel dataset featuring human-crafted instructions that require context-sensitive reasoning for text-rich images.
We conduct experiments to assess the performance of 14 foundation models and establish a human performance baseline.
We observe a significant performance gap of 30.8% between GPT-4V and human performance.
arXiv Detail & Related papers (2024-01-24T09:07:11Z) - Holistic Analysis of Hallucination in GPT-4V(ision): Bias and
Interference Challenges [54.42256219010956]
This benchmark is designed to evaluate and shed light on the two common types of hallucinations in visual language models: bias and interference.
bias refers to the model's tendency to hallucinate certain types of responses, possibly due to imbalance in its training data.
interference pertains to scenarios where the judgment of GPT-4V(ision) can be disrupted due to how the text prompt is phrased or how the input image is presented.
arXiv Detail & Related papers (2023-11-06T17:26:59Z) - GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks [70.98062518872999]
We validate GPT-4V's capabilities for evaluation purposes, addressing tasks ranging from foundational image-to-text and text-to-image synthesis to high-level image-to-image translations and multi-images to text alignment.
Notably, GPT-4V shows promising agreement with humans across various tasks and evaluation methods, demonstrating immense potential for multi-modal LLMs as evaluators.
arXiv Detail & Related papers (2023-11-02T16:11:09Z) - Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task
Instruction Tuning [27.544311403607786]
We introduce the Ziya-Visual series, a set of bilingual large-scale vision-language models (LVLMs)
Our models adopt the Querying Transformer from BLIP-2, further exploring the assistance of optimization schemes.
In addition, we stimulate the understanding ability of GPT-4 in multi-modal scenarios, translating our gathered English image-text datasets into Chinese.
arXiv Detail & Related papers (2023-10-12T09:39:17Z) - OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks.
OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - Probing Contextual Language Models for Common Ground with Visual
Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations.
Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories.
Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.