Evaluating GPT-3.5 and GPT-4 Models on Brazilian University Admission
Exams
- URL: http://arxiv.org/abs/2303.17003v1
- Date: Wed, 29 Mar 2023 20:10:13 GMT
- Title: Evaluating GPT-3.5 and GPT-4 Models on Brazilian University Admission
Exams
- Authors: Desnes Nunes, Ricardo Primi, Ramon Pires, Roberto Lotufo, and Rodrigo
Nogueira
- Abstract summary: This study aims to explore the capabilities of Language Models (LMs) in tackling high-stakes multiple-choice tests.
This exam poses challenging tasks for LMs, since its questions may span into multiple fields of knowledge.
The best-performing model, GPT-4 with Chain-of-Thought prompts, achieved an accuracy of 87%, largely surpassing GPT-3.5 by 11 points.
- Score: 4.2706617195518195
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The present study aims to explore the capabilities of Language Models (LMs)
in tackling high-stakes multiple-choice tests, represented here by the Exame
Nacional do Ensino M\'edio (ENEM), a multidisciplinary entrance examination
widely adopted by Brazilian universities. This exam poses challenging tasks for
LMs, since its questions may span into multiple fields of knowledge, requiring
understanding of information from diverse domains. For instance, a question may
require comprehension of both statistics and biology to be solved. This work
analyzed responses generated by GPT-3.5 and GPT-4 models for questions
presented in the 2009-2017 exams, as well as for questions of the 2022 exam,
which were made public after the training of the models was completed.
Furthermore, different prompt strategies were tested, including the use of
Chain-of-Thought (CoT) prompts to generate explanations for answers. On the
2022 edition, the best-performing model, GPT-4 with CoT, achieved an accuracy
of 87%, largely surpassing GPT-3.5 by 11 points. The code and data used on
experiments are available at https://github.com/piresramon/gpt-4-enem.
Related papers
- Could ChatGPT get an Engineering Degree? Evaluating Higher Education Vulnerability to AI Assistants [175.9723801486487]
We evaluate whether two AI assistants, GPT-3.5 and GPT-4, can adequately answer assessment questions.
GPT-4 answers an average of 65.8% of questions correctly, and can even produce the correct answer across at least one prompting strategy for 85.1% of questions.
Our results call for revising program-level assessment design in higher education in light of advances in generative AI.
arXiv Detail & Related papers (2024-08-07T12:11:49Z) - Generative AI for Enhancing Active Learning in Education: A Comparative Study of GPT-3.5 and GPT-4 in Crafting Customized Test Questions [2.0411082897313984]
This study investigates how LLMs, specifically GPT-3.5 and GPT-4, can develop tailored questions for Grade 9 math.
By utilizing an iterative method, these models adjust questions based on difficulty and content, responding to feedback from a simulated'student' model.
arXiv Detail & Related papers (2024-06-20T00:25:43Z) - OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems [62.06169250463104]
We present OlympiadBench, an Olympiad-level bilingual multimodal scientific benchmark, featuring 8,476 problems from Olympiad-level mathematics and physics competitions.
The best-performing model, GPT-4V, attains an average score of 17.97% on OlympiadBench, with a mere 10.74% in physics.
Our analysis orienting GPT-4V points out prevalent issues with hallucinations, knowledge omissions, and logical fallacies.
arXiv Detail & Related papers (2024-02-21T18:49:26Z) - Evaluating GPT-4's Vision Capabilities on Brazilian University Admission
Exams [14.801853435122908]
We present a framework to evaluate language models on entrance exams, which incorporates both textual and visual elements.
We evaluate the two most recent editions of Exame Nacional do Ensino M'edio (ENEM), the main standardized entrance examination adopted by Brazilian universities.
One of the highlights is that text captions transcribing visual content outperform the direct use of images, suggesting that the vision model has room for improvement.
arXiv Detail & Related papers (2023-11-23T19:20:59Z) - Large Language Models on Wikipedia-Style Survey Generation: an Evaluation in NLP Concepts [21.150221839202878]
Large Language Models (LLMs) have achieved significant success across various general tasks.
In this work, we examine the proficiency of LLMs in generating succinct survey articles specific to the niche field of NLP in computer science.
We compare both human and GPT-based evaluation scores and provide in-depth analysis.
arXiv Detail & Related papers (2023-08-21T01:32:45Z) - ARB: Advanced Reasoning Benchmark for Large Language Models [94.37521840642141]
We introduce ARB, a novel benchmark composed of advanced reasoning problems in multiple fields.
As a subset of ARB, we introduce a challenging set of math and physics problems which require advanced symbolic reasoning and domain knowledge.
We evaluate recent models such as GPT-4 and Claude on ARB and demonstrate that current models score well below 50% on more demanding tasks.
arXiv Detail & Related papers (2023-07-25T17:55:19Z) - How is ChatGPT's behavior changing over time? [72.79311931941876]
We evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4.
We find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time.
arXiv Detail & Related papers (2023-07-18T06:56:08Z) - Exploring the MIT Mathematics and EECS Curriculum Using Large Language
Models [21.86774454216937]
We evaluate the ability of large language models to fulfill the graduation requirements for any MIT major in Mathematics and EECS.
Our results demonstrate that GPT-3.5 successfully solves a third of the entire MIT curriculum, while GPT-4, with prompt engineering, achieves a perfect solve rate on a test set excluding questions based on images.
arXiv Detail & Related papers (2023-06-15T09:48:14Z) - M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining
Large Language Models [76.88692952308084]
M3Exam is a benchmark for evaluating large language models (LLMs) in a multilingual, multimodal, and multilevel context.
M3Exam contains 12,317 questions in 9 diverse languages with three educational levels.
We assess the performance of top-performing LLMs on M3Exam and find that current models, including GPT-4, still struggle with multilingual text.
arXiv Detail & Related papers (2023-06-08T13:21:29Z) - AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models [122.63704560157909]
We introduce AGIEval, a novel benchmark designed to assess foundation model in the context of human-centric standardized exams.
We evaluate several state-of-the-art foundation models, including GPT-4, ChatGPT, and Text-Davinci-003.
GPT-4 surpasses average human performance on SAT, LSAT, and math competitions, attaining a 95% accuracy rate on the SAT Math test and a 92.5% accuracy on the English test of the Chinese national college entrance exam.
arXiv Detail & Related papers (2023-04-13T09:39:30Z) - GPT Takes the Bar Exam [0.0]
We document our experimental evaluation of the performance of OpenAI's text-davinci-003 model, often-referred to as GPT-3.5.
For best prompt and parameters, GPT-3.5 achieves a headline correct rate of 50.3% on a complete NCBE MBE practice exam.
arXiv Detail & Related papers (2022-12-29T18:19:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.