Related papers: Grade Like a Human: Rethinking Automated Assessment with Large Language Models

Grade Like a Human: Rethinking Automated Assessment with Large Language Models

URL: http://arxiv.org/abs/2405.19694v1
Date: Thu, 30 May 2024 05:08:15 GMT
Title: Grade Like a Human: Rethinking Automated Assessment with Large Language Models
Authors: Wenjing Xie, Juxin Niu, Chun Jason Xue, Nan Guan,
Abstract summary: Large language models (LLMs) have been used for automated grading, but they have not yet achieved the same level of performance as humans. We propose an LLM-based grading system that addresses the entire grading procedure, including the following key components.
Score: 11.442433408767583
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While large language models (LLMs) have been used for automated grading, they have not yet achieved the same level of performance as humans, especially when it comes to grading complex questions. Existing research on this topic focuses on a particular step in the grading procedure: grading using predefined rubrics. However, grading is a multifaceted procedure that encompasses other crucial steps, such as grading rubrics design and post-grading review. There has been a lack of systematic research exploring the potential of LLMs to enhance the entire grading~process. In this paper, we propose an LLM-based grading system that addresses the entire grading procedure, including the following key components: 1) Developing grading rubrics that not only consider the questions but also the student answers, which can more accurately reflect students' performance. 2) Under the guidance of grading rubrics, providing accurate and consistent scores for each student, along with customized feedback. 3) Conducting post-grading review to better ensure accuracy and fairness. Additionally, we collected a new dataset named OS from a university operating system course and conducted extensive experiments on both our new dataset and the widely used Mohler dataset. Experiments demonstrate the effectiveness of our proposed approach, providing some new insights for developing automated grading systems based on LLMs.

Related papers

Ensemble ToT of LLMs and Its Application to Automatic Grading System for Supporting Self-Learning [0.8490659704051299]
Ensemble Tree-of-Thought (ToT) is a framework that enhances LLM outputs by integrating multiple models. Our grading system first evaluates the grading tendencies of LLMs, then generates multiple results, and finally integrates them via a simulated debate.
arXiv Detail & Related papers (2025-02-23T01:17:46Z)
Language Models are Few-Shot Graders [0.12289361708127876]
We present an ASAG pipeline leveraging state-of-the-art LLMs. We compare the grading performance of three OpenAI models: GPT-4, GPT-4o, and o1-preview. Our findings indicate that providing graded examples enhances grading accuracy, with RAG-based selection outperforming random selection.
arXiv Detail & Related papers (2025-02-18T23:38:21Z)
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia. In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models. This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z)
Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets. The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method. The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z)
A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look [52.114284476700874]
This paper reports on the results of a large-scale evaluation (the TREC 2024 RAG Track) where four different relevance assessment approaches were deployed. We find that automatically generated UMBRELA judgments can replace fully manual judgments to accurately capture run-level effectiveness. Surprisingly, we find that LLM assistance does not appear to increase correlation with fully manual assessments, suggesting that costs associated with human-in-the-loop processes do not bring obvious tangible benefits.
arXiv Detail & Related papers (2024-11-13T01:12:35Z)
A LLM-Powered Automatic Grading Framework with Human-Level Guidelines Optimization [31.722907135361492]
Open-ended short-answer questions (SAGs) have been widely recognized as a powerful tool for providing deeper insights into learners' responses in the context of learning analytics (LA) SAGs often present challenges in practice due to the high grading workload and concerns about inconsistent assessments. We propose a unified multi-agent ASAG framework, GradeOpt, which leverages large language models (LLMs) as graders for SAGs.
arXiv Detail & Related papers (2024-10-03T03:11:24Z)
Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored. We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches. We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z)
ASAG2024: A Combined Benchmark for Short Answer Grading [0.10826342457160269]
Short Answer Grading (SAG) systems aim to automatically score students' answers. There exists no comprehensive short-answer grading benchmark across different subjects, grading scales, and distributions. We introduce the combined ASAG2024 benchmark to facilitate the comparison of automated grading systems.
arXiv Detail & Related papers (2024-09-27T09:56:02Z)
"I understand why I got this grade": Automatic Short Answer Grading with Feedback [36.74896284581596]
We present a dataset of 5.8k student answers accompanied by reference answers and questions for the Automatic Short Answer Grading (ASAG) task. The EngSAF dataset is meticulously curated to cover a diverse range of subjects, questions, and answer patterns from multiple engineering domains.
arXiv Detail & Related papers (2024-06-30T15:42:18Z)
Towards LLM-based Autograding for Short Textual Answers [4.853810201626855]
This manuscript is an evaluation of a large language model for the purpose of autograding. Our findings suggest that while "out-of-the-box" LLMs provide a valuable tool, their readiness for independent automated grading remains a work in progress.
arXiv Detail & Related papers (2023-09-09T22:25:56Z)
Bias and Fairness in Large Language Models: A Survey [73.87651986156006]
We present a comprehensive survey of bias evaluation and mitigation techniques for large language models (LLMs) We first consolidate, formalize, and expand notions of social bias and fairness in natural language processing. We then unify the literature by proposing three intuitive, two for bias evaluation, and one for mitigation.
arXiv Detail & Related papers (2023-09-02T00:32:55Z)
Automated grading workflows for providing personalized feedback to open-ended data science assignments [1.534667887016089]
In this paper, we discuss the steps of a typical grading workflow and highlight which steps can be automated in an approach that we call automated grading workflow. We illustrate how gradetools, a new R package, implements this approach within RStudio to facilitate efficient and consistent grading while providing individualized feedback.
arXiv Detail & Related papers (2023-08-18T01:22:11Z)
ProtoTransformer: A Meta-Learning Approach to Providing Student Feedback [54.142719510638614]
In this paper, we frame the problem of providing feedback as few-shot classification. A meta-learner adapts to give feedback to student code on a new programming question from just a few examples by instructors. Our approach was successfully deployed to deliver feedback to 16,000 student exam-solutions in a programming course offered by a tier 1 university.
arXiv Detail & Related papers (2021-07-23T22:41:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.