ChatGPT for automated grading of short answer questions in mechanical ventilation
- URL: http://arxiv.org/abs/2505.04645v1
- Date: Mon, 05 May 2025 19:04:25 GMT
- Title: ChatGPT for automated grading of short answer questions in mechanical ventilation
- Authors: Tejas Jade, Alex Yartsev,
- Abstract summary: Large language models (LLMs) simulate conversational language and interpret unstructured free-text responses.<n>We evaluated ChatGPT 4o to grade SAQs in a postgraduate medical setting using data from 215 students.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Standardised tests using short answer questions (SAQs) are common in postgraduate education. Large language models (LLMs) simulate conversational language and interpret unstructured free-text responses in ways aligning with applying SAQ grading rubrics, making them attractive for automated grading. We evaluated ChatGPT 4o to grade SAQs in a postgraduate medical setting using data from 215 students (557 short-answer responses) enrolled in an online course on mechanical ventilation (2020--2024). Deidentified responses to three case-based scenarios were presented to ChatGPT with a standardised grading prompt and rubric. Outputs were analysed using mixed-effects modelling, variance component analysis, intraclass correlation coefficients (ICCs), Cohen's kappa, Kendall's W, and Bland--Altman statistics. ChatGPT awarded systematically lower marks than human graders with a mean difference (bias) of -1.34 on a 10-point scale. ICC values indicated poor individual-level agreement (ICC1 = 0.086), and Cohen's kappa (-0.0786) suggested no meaningful agreement. Variance component analysis showed minimal variability among the five ChatGPT sessions (G-value = 0.87), indicating internal consistency but divergence from the human grader. The poorest agreement was observed for evaluative and analytic items, whereas checklist and prescriptive rubric items had less disagreement. We caution against the use of LLMs in grading postgraduate coursework. Over 60% of ChatGPT-assigned grades differed from human grades by more than acceptable boundaries for high-stakes assessments.
Related papers
- Assessing the Reliability of Large Language Models for Deductive Qualitative Coding: A Comparative Study of ChatGPT Interventions [0.0]
This study investigates the use of large language models (LLMs) for structured qualitative coding.<n>We classified U.S. Supreme Court case summaries into 21 major policy domains.<n>ChatGPT displayed stable agreement across samples, including high F1 scores in low-support subclasses.
arXiv Detail & Related papers (2025-07-18T22:16:04Z) - Artificial Intelligence Bias on English Language Learners in Automatic Scoring [23.76046619016318]
We fine-tuned BERT with four datasets: responses from ELLs, (2) non-ELLs, (3) a mixed dataset reflecting the real-world proportion of ELLs and non-ELLs, and (4) a balanced mixed dataset with equal representation of both groups.<n>We measured the Mean Score Gaps (MSGs) between ELLs and non-ELLs and then calculated the differences in MSGs generated through both the human and AI models to identify the scoring disparities.<n>We found that no AI bias and distorted disparities between ELLs and non-ELLs were found when the training dataset was large
arXiv Detail & Related papers (2025-05-15T18:32:24Z) - Evaluating the Predictive Capacity of ChatGPT for Academic Peer Review Outcomes Across Multiple Platforms [3.3543455244780223]
This paper introduces two new contexts and employs a more robust method - averaging multiple ChatGPT scores.
Findings that averaging 30 ChatGPT predictions, based on reviewer guidelines and using only submitted titles and abstracts, failed to predict peer review outcomes for F1000Research.
arXiv Detail & Related papers (2024-11-14T19:20:33Z) - Towards Scalable Automated Grading: Leveraging Large Language Models for Conceptual Question Evaluation in Engineering [5.160473221022088]
This study explores the feasibility of using large language models (LLMs) for automated grading of conceptual questions.
We compared the grading performance of GPT-4o with that of human teaching assistants (TAs) on ten quiz problems from the MEEN 361 course at Texas A&M University.
Our analysis reveals that GPT-4o performs well when grading criteria are straightforward but struggles with nuanced answers.
arXiv Detail & Related papers (2024-11-06T04:41:13Z) - Evaluating ChatGPT as a Question Answering System: A Comprehensive
Analysis and Comparison with Existing Models [0.0]
This article scrutinizes ChatGPT as a Question Answering System (QAS)
The primary focus is on evaluating ChatGPT's proficiency in extracting responses from provided paragraphs.
The evaluation highlights hallucinations, where ChatGPT provides responses to questions without available answers in the provided context.
arXiv Detail & Related papers (2023-12-11T08:49:18Z) - Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text Ratings [63.35165397320137]
This study investigates the consistency of feedback ratings generated by OpenAI's GPT-4.
The model rated responses to tasks within the Higher Education subject domain of macroeconomics in terms of their content and style.
arXiv Detail & Related papers (2023-08-03T12:47:17Z) - Large Language Models are not Fair Evaluators [60.27164804083752]
We find that the quality ranking of candidate responses can be easily hacked by altering their order of appearance in the context.
This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other.
We propose a framework with three simple yet effective strategies to mitigate this issue.
arXiv Detail & Related papers (2023-05-29T07:41:03Z) - Using Natural Language Explanations to Rescale Human Judgments [81.66697572357477]
We propose a method to rescale ordinal annotations and explanations using large language models (LLMs)
We feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric.
Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.
arXiv Detail & Related papers (2023-05-24T06:19:14Z) - Distilling ChatGPT for Explainable Automated Student Answer Assessment [19.604476650824516]
We introduce a novel framework that explores using ChatGPT, a cutting-edge large language model, for the concurrent tasks of student answer scoring and rationale generation.
Our experiments show that the proposed method improves the overall QWK score by 11% compared to ChatGPT.
arXiv Detail & Related papers (2023-05-22T12:11:39Z) - Enabling Classifiers to Make Judgements Explicitly Aligned with Human
Values [73.82043713141142]
Many NLP classification tasks, such as sexism/racism detection or toxicity detection, are based on human values.
We introduce a framework for value-aligned classification that performs prediction based on explicitly written human values in the command.
arXiv Detail & Related papers (2022-10-14T09:10:49Z) - Few-shot Instruction Prompts for Pretrained Language Models to Detect
Social Biases [55.45617404586874]
We propose a few-shot instruction-based method for prompting pre-trained language models (LMs)
We show that large LMs can detect different types of fine-grained biases with similar and sometimes superior accuracy to fine-tuned models.
arXiv Detail & Related papers (2021-12-15T04:19:52Z) - AES Systems Are Both Overstable And Oversensitive: Explaining Why And
Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models.
Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models.
We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z) - Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring
Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics.
We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.