Related papers: Assessing GPT Performance in a Proof-Based University-Level Course Under Blind Grading

Assessing GPT Performance in a Proof-Based University-Level Course Under Blind Grading

URL: http://arxiv.org/abs/2505.13664v1
Date: Mon, 19 May 2025 19:05:48 GMT
Title: Assessing GPT Performance in a Proof-Based University-Level Course Under Blind Grading
Authors: Ming Ding, Rasmus Kyng, Federico Solda, Weixuan Yuan,
Abstract summary: This study assesses the performance of GPT-4o and o1-preview under realistic educational conditions in an undergraduate algorithms course.<n>Results show that GPT-4o consistently struggles, failing to reach the passing threshold, while o1-preview performs significantly better.<n>These findings highlight the need for robust assessment strategies and AI-aware grading policies in education.
Score: 8.206694431501832
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large language models (LLMs) advance, their role in higher education, particularly in free-response problem-solving, requires careful examination. This study assesses the performance of GPT-4o and o1-preview under realistic educational conditions in an undergraduate algorithms course. Anonymous GPT-generated solutions to take-home exams were graded by teaching assistants unaware of their origin. Our analysis examines both coarse-grained performance (scores) and fine-grained reasoning quality (error patterns). Results show that GPT-4o consistently struggles, failing to reach the passing threshold, while o1-preview performs significantly better, surpassing the passing score and even exceeding the student median in certain exercises. However, both models exhibit issues with unjustified claims and misleading arguments. These findings highlight the need for robust assessment strategies and AI-aware grading policies in education.

Related papers

Evaluating GPT- and Reasoning-based Large Language Models on Physics Olympiad Problems: Surpassing Human Performance and Implications for Educational Assessment [0.0]
Large language models (LLMs) are now widely accessible, reaching learners at all educational levels.<n>This study compares the problem-solving performance of a general-purpose LLM (GPT-4o) and a reasoning-optimized model (o1-preview) with that of participants of the German Physics Olympiad.
arXiv Detail & Related papers (2025-05-14T14:46:32Z)
Assessing instructor-AI cooperation for grading essay-type questions in an introductory sociology course [0.0]
We evaluate generative pre-trained transformers (GPT) models' performance in transcribing and scoring students' responses.<n>For grading, GPT demonstrated strong correlations with the human grader scores, especially when template answers were provided.<n>This study contributes to the growing literature on AI in education, demonstrating its potential to enhance fairness and efficiency in grading essay-type questions.
arXiv Detail & Related papers (2025-01-11T07:18:12Z)
Evaluating GPT-4 at Grading Handwritten Solutions in Math Exams [48.99818550820575]
We leverage state-of-the-art multi-modal AI models, in particular GPT-4o, to automatically grade handwritten responses to college-level math exams.<n>Using real student responses to questions in a probability theory exam, we evaluate GPT-4o's alignment with ground-truth scores from human graders using various prompting techniques.
arXiv Detail & Related papers (2024-11-07T22:51:47Z)
Could ChatGPT get an Engineering Degree? Evaluating Higher Education Vulnerability to AI Assistants [176.39275404745098]
We evaluate whether two AI assistants, GPT-3.5 and GPT-4, can adequately answer assessment questions.<n>GPT-4 answers an average of 65.8% of questions correctly, and can even produce the correct answer across at least one prompting strategy for 85.1% of questions.<n>Our results call for revising program-level assessment design in higher education in light of advances in generative AI.
arXiv Detail & Related papers (2024-08-07T12:11:49Z)
Evaluating Large Language Models on the GMAT: Implications for the Future of Business Education [0.13654846342364302]
This study introduces the first benchmark to assess the performance of seven major Large Language Models (LLMs) Our analysis shows that most LLMs outperform human candidates, with GPT-4 Turbo not only outperforming the other models but also surpassing the average scores of graduate students at top business schools. While AI's promise in education, assessment, and tutoring is clear, challenges remain.
arXiv Detail & Related papers (2024-01-02T03:54:50Z)
CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting. CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z)
Large Language Models on Wikipedia-Style Survey Generation: an Evaluation in NLP Concepts [21.150221839202878]
Large Language Models (LLMs) have achieved significant success across various general tasks. In this work, we examine the proficiency of LLMs in generating succinct survey articles specific to the niche field of NLP in computer science. We compare both human and GPT-based evaluation scores and provide in-depth analysis.
arXiv Detail & Related papers (2023-08-21T01:32:45Z)
ARB: Advanced Reasoning Benchmark for Large Language Models [94.37521840642141]
We introduce ARB, a novel benchmark composed of advanced reasoning problems in multiple fields. As a subset of ARB, we introduce a challenging set of math and physics problems which require advanced symbolic reasoning and domain knowledge. We evaluate recent models such as GPT-4 and Claude on ARB and demonstrate that current models score well below 50% on more demanding tasks.
arXiv Detail & Related papers (2023-07-25T17:55:19Z)
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models [122.63704560157909]
We introduce AGIEval, a novel benchmark designed to assess foundation model in the context of human-centric standardized exams. We evaluate several state-of-the-art foundation models, including GPT-4, ChatGPT, and Text-Davinci-003. GPT-4 surpasses average human performance on SAT, LSAT, and math competitions, attaining a 95% accuracy rate on the SAT Math test and a 92.5% accuracy on the English test of the Chinese national college entrance exam.
arXiv Detail & Related papers (2023-04-13T09:39:30Z)
Fairness-guided Few-shot Prompting for Large Language Models [93.05624064699965]
In-context learning can suffer from high instability due to variations in training examples, example order, and prompt formats. We introduce a metric to evaluate the predictive bias of a fixed prompt against labels or a given attributes. We propose a novel search strategy based on the greedy search to identify the near-optimal prompt for improving the performance of in-context learning.
arXiv Detail & Related papers (2023-03-23T12:28:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.