Towards Scalable Automated Grading: Leveraging Large Language Models for Conceptual Question Evaluation in Engineering
- URL: http://arxiv.org/abs/2411.03659v1
- Date: Wed, 06 Nov 2024 04:41:13 GMT
- Title: Towards Scalable Automated Grading: Leveraging Large Language Models for Conceptual Question Evaluation in Engineering
- Authors: Rujun Gao, Xiaosu Guo, Xiaodi Li, Arun Balajiee Lekshmi Narayanan, Naveen Thomas, Arun R. Srinivasa,
- Abstract summary: This study explores the feasibility of using large language models (LLMs) for automated grading of conceptual questions.
We compared the grading performance of GPT-4o with that of human teaching assistants (TAs) on ten quiz problems from the MEEN 361 course at Texas A&M University.
Our analysis reveals that GPT-4o performs well when grading criteria are straightforward but struggles with nuanced answers.
- Score: 5.160473221022088
- License:
- Abstract: This study explores the feasibility of using large language models (LLMs), specifically GPT-4o (ChatGPT), for automated grading of conceptual questions in an undergraduate Mechanical Engineering course. We compared the grading performance of GPT-4o with that of human teaching assistants (TAs) on ten quiz problems from the MEEN 361 course at Texas A&M University, each answered by approximately 225 students. Both the LLM and TAs followed the same instructor-provided rubric to ensure grading consistency. We evaluated performance using Spearman's rank correlation coefficient and Root Mean Square Error (RMSE) to assess the alignment between rankings and the accuracy of scores assigned by GPT-4o and TAs under zero- and few-shot grading settings. In the zero-shot setting, GPT-4o demonstrated a strong correlation with TA grading, with Spearman's rank correlation coefficient exceeding 0.6 in seven out of ten datasets and reaching a high of 0.9387. Our analysis reveals that GPT-4o performs well when grading criteria are straightforward but struggles with nuanced answers, particularly those involving synonyms not present in the rubric. The model also tends to grade more stringently in ambiguous cases compared to human TAs. Overall, ChatGPT shows promise as a tool for grading conceptual questions, offering scalability and consistency.
Related papers
- A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look [52.114284476700874]
This paper reports on the results of a large-scale evaluation (the TREC 2024 RAG Track) where four different relevance assessment approaches were deployed.
We find that automatically generated UMBRELA judgments can replace fully manual judgments to accurately capture run-level effectiveness.
Surprisingly, we find that LLM assistance does not appear to increase correlation with fully manual assessments, suggesting that costs associated with human-in-the-loop processes do not bring obvious tangible benefits.
arXiv Detail & Related papers (2024-11-13T01:12:35Z) - Evaluating GPT-4 at Grading Handwritten Solutions in Math Exams [48.99818550820575]
We leverage state-of-the-art multi-modal AI models, in particular GPT-4o, to automatically grade handwritten responses to college-level math exams.
Using real student responses to questions in a probability theory exam, we evaluate GPT-4o's alignment with ground-truth scores from human graders using various prompting techniques.
arXiv Detail & Related papers (2024-11-07T22:51:47Z) - A comparison of Human, GPT-3.5, and GPT-4 Performance in a University-Level Coding Course [0.0]
This study evaluates the performance of ChatGPT variants, GPT-3.5 and GPT-4, against solely student work and a mixed category containing both student and GPT-4 contributions in university-level physics coding assignments using the Python language.
Students averaged 91.9% (SE:0.4), surpassing the highest performing AI submission category, GPT-4 with prompt engineering, which scored 81.1% (SE:0.8) - a statistically significant difference (p = $2.482 times 10-10$)
The blinded markers were tasked with guessing the authorship of the submissions on a four-point Likert scale from Definitely
arXiv Detail & Related papers (2024-03-25T17:41:02Z) - Applying Large Language Models and Chain-of-Thought for Automatic
Scoring [23.076596289069506]
This study investigates the application of large language models (LLMs) in the automatic scoring of student-written responses to science assessments.
We focused on overcoming the challenges of accessibility, technical complexity, and lack of explainability that have previously limited the use of artificial intelligence-based automatic scoring tools.
arXiv Detail & Related papers (2023-11-30T21:22:43Z) - Prometheus: Inducing Fine-grained Evaluation Capability in Language
Models [66.12432440863816]
We propose Prometheus, a fully open-source Large Language Model (LLM) that is on par with GPT-4's evaluation capabilities.
Prometheus scores a Pearson correlation of 0.897 with human evaluators when evaluating with 45 customized score rubrics.
Prometheus achieves the highest accuracy on two human preference benchmarks.
arXiv Detail & Related papers (2023-10-12T16:50:08Z) - Split and Merge: Aligning Position Biases in Large Language Model based
Evaluators [23.38206418382832]
PORTIA is an alignment-based system designed to mimic human comparison strategies to calibrate position bias.
Our results show that PORTIA markedly enhances the consistency rates for all the models and comparison forms tested.
It rectifies around 80% of the position bias instances within the GPT-4 model, elevating its consistency rate up to 98%.
arXiv Detail & Related papers (2023-09-29T14:38:58Z) - Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text Ratings [63.35165397320137]
This study investigates the consistency of feedback ratings generated by OpenAI's GPT-4.
The model rated responses to tasks within the Higher Education subject domain of macroeconomics in terms of their content and style.
arXiv Detail & Related papers (2023-08-03T12:47:17Z) - ARB: Advanced Reasoning Benchmark for Large Language Models [94.37521840642141]
We introduce ARB, a novel benchmark composed of advanced reasoning problems in multiple fields.
As a subset of ARB, we introduce a challenging set of math and physics problems which require advanced symbolic reasoning and domain knowledge.
We evaluate recent models such as GPT-4 and Claude on ARB and demonstrate that current models score well below 50% on more demanding tasks.
arXiv Detail & Related papers (2023-07-25T17:55:19Z) - Progressive-Hint Prompting Improves Reasoning in Large Language Models [63.98629132836499]
This paper proposes a new prompting method, named Progressive-Hint Prompting (PHP)
It enables automatic multiple interactions between users and Large Language Models (LLMs) by using previously generated answers as hints to progressively guide toward the correct answers.
We conducted extensive and comprehensive experiments on seven benchmarks. The results show that PHP significantly improves accuracy while remaining highly efficient.
arXiv Detail & Related papers (2023-04-19T16:29:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.