Evaluating the Performance of Large Language Models for Spanish Language
in Undergraduate Admissions Exams
- URL: http://arxiv.org/abs/2312.16845v1
- Date: Thu, 28 Dec 2023 06:23:39 GMT
- Title: Evaluating the Performance of Large Language Models for Spanish Language
in Undergraduate Admissions Exams
- Authors: Sabino Miranda, Obdulia Pichardo-Lagunas, Bella Mart\'inez-Seis,
Pierre Baldi
- Abstract summary: This study evaluates the performance of large language models, specifically GPT-3.5 and BARD, in undergraduate admissions exams proposed by the National Polytechnic Institute in Mexico.
Both models demonstrated proficiency, exceeding the minimum acceptance scores for respective academic programs to up to 75% for some academic programs.
- Score: 4.974500659156055
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study evaluates the performance of large language models, specifically
GPT-3.5 and BARD (supported by Gemini Pro model), in undergraduate admissions
exams proposed by the National Polytechnic Institute in Mexico. The exams cover
Engineering/Mathematical and Physical Sciences, Biological and Medical
Sciences, and Social and Administrative Sciences. Both models demonstrated
proficiency, exceeding the minimum acceptance scores for respective academic
programs to up to 75% for some academic programs. GPT-3.5 outperformed BARD in
Mathematics and Physics, while BARD performed better in History and questions
related to factual information. Overall, GPT-3.5 marginally surpassed BARD with
scores of 60.94% and 60.42%, respectively.
Related papers
- Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models [63.31878920079154]
We propose a benchmark specifically designed to assess large language models' mathematical reasoning at the Olympiad level.
Unlike existing Olympiad-related benchmarks, our dataset focuses exclusively on mathematics.
Our experimental results show that even the most advanced models, OpenAI o1-mini and OpenAI o1-preview, struggle with highly challenging Olympiad-level problems.
arXiv Detail & Related papers (2024-10-10T14:39:33Z) - Could ChatGPT get an Engineering Degree? Evaluating Higher Education Vulnerability to AI Assistants [175.9723801486487]
We evaluate whether two AI assistants, GPT-3.5 and GPT-4, can adequately answer assessment questions.
GPT-4 answers an average of 65.8% of questions correctly, and can even produce the correct answer across at least one prompting strategy for 85.1% of questions.
Our results call for revising program-level assessment design in higher education in light of advances in generative AI.
arXiv Detail & Related papers (2024-08-07T12:11:49Z) - GPT-4 passes most of the 297 written Polish Board Certification Examinations [0.5461938536945723]
This study evaluated the performance of three Generative Pretrained Transformer (GPT) models on the Polish Board Certification Exam (Pa'nstwowy Egzamin Specjalizacyjny, PES) dataset.
The GPT models varied significantly, displaying excellence in exams related to certain specialties while completely failing others.
arXiv Detail & Related papers (2024-04-29T09:08:22Z) - OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems [62.06169250463104]
We present OlympiadBench, an Olympiad-level bilingual multimodal scientific benchmark, featuring 8,476 problems from Olympiad-level mathematics and physics competitions.
The best-performing model, GPT-4V, attains an average score of 17.97% on OlympiadBench, with a mere 10.74% in physics.
Our analysis orienting GPT-4V points out prevalent issues with hallucinations, knowledge omissions, and logical fallacies.
arXiv Detail & Related papers (2024-02-21T18:49:26Z) - Fine-tuning ChatGPT for Automatic Scoring [1.4833692070415454]
This study highlights the potential of fine-tuned ChatGPT (GPT3.5) for automatically scoring student written constructed responses.
We compare the performance of fine-tuned GPT-3.5 with the fine-tuned state-of-the-art Google's generated language model, BERT.
arXiv Detail & Related papers (2023-10-16T05:09:16Z) - Performance of Large Language Models in a Computer Science Degree
Program [0.5330240017302619]
This paper presents findings on the performance of different large language models in a university of applied sciences' undergraduate computer science degree program.
By prompting the models with lecture material, exercise tasks, and past exams, we aim to evaluate their proficiency across different computer science domains.
We found that ChatGPT-3.5 averaged 79.9% of the total score in 10 tested modules, BingAI achieved 68.4%, and LLaMa, in the 65 billion parameter variant, 20%.
arXiv Detail & Related papers (2023-07-24T14:17:00Z) - AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models [122.63704560157909]
We introduce AGIEval, a novel benchmark designed to assess foundation model in the context of human-centric standardized exams.
We evaluate several state-of-the-art foundation models, including GPT-4, ChatGPT, and Text-Davinci-003.
GPT-4 surpasses average human performance on SAT, LSAT, and math competitions, attaining a 95% accuracy rate on the SAT Math test and a 92.5% accuracy on the English test of the Chinese national college entrance exam.
arXiv Detail & Related papers (2023-04-13T09:39:30Z) - Evaluating GPT-3.5 and GPT-4 Models on Brazilian University Admission
Exams [4.2706617195518195]
This study aims to explore the capabilities of Language Models (LMs) in tackling high-stakes multiple-choice tests.
This exam poses challenging tasks for LMs, since its questions may span into multiple fields of knowledge.
The best-performing model, GPT-4 with Chain-of-Thought prompts, achieved an accuracy of 87%, largely surpassing GPT-3.5 by 11 points.
arXiv Detail & Related papers (2023-03-29T20:10:13Z) - Large Language Models in the Workplace: A Case Study on Prompt
Engineering for Job Type Classification [58.720142291102135]
This case study investigates the task of job classification in a real-world setting.
The goal is to determine whether an English-language job posting is appropriate for a graduate or entry-level position.
arXiv Detail & Related papers (2023-03-13T14:09:53Z) - GPT Takes the Bar Exam [0.0]
We document our experimental evaluation of the performance of OpenAI's text-davinci-003 model, often-referred to as GPT-3.5.
For best prompt and parameters, GPT-3.5 achieves a headline correct rate of 50.3% on a complete NCBE MBE practice exam.
arXiv Detail & Related papers (2022-12-29T18:19:43Z) - Reasoning Like Program Executors [48.819113224699976]
POET empowers language models to harvest the reasoning knowledge possessed in program executors via a data-driven approach.
PoET can significantly boost model performance on natural language reasoning.
PoET opens a new gate on reasoning-enhancement pre-training.
arXiv Detail & Related papers (2022-01-27T12:28:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.