A LLM-Powered Automatic Grading Framework with Human-Level Guidelines Optimization
- URL: http://arxiv.org/abs/2410.02165v1
- Date: Thu, 3 Oct 2024 03:11:24 GMT
- Title: A LLM-Powered Automatic Grading Framework with Human-Level Guidelines Optimization
- Authors: Yucheng Chu, Hang Li, Kaiqi Yang, Harry Shomer, Hui Liu, Yasemin Copur-Gencturk, Jiliang Tang,
- Abstract summary: Open-ended short-answer questions (SAGs) have been widely recognized as a powerful tool for providing deeper insights into learners' responses in the context of learning analytics (LA)
SAGs often present challenges in practice due to the high grading workload and concerns about inconsistent assessments.
We propose a unified multi-agent ASAG framework, GradeOpt, which leverages large language models (LLMs) as graders for SAGs.
- Score: 31.722907135361492
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-ended short-answer questions (SAGs) have been widely recognized as a powerful tool for providing deeper insights into learners' responses in the context of learning analytics (LA). However, SAGs often present challenges in practice due to the high grading workload and concerns about inconsistent assessments. With recent advancements in natural language processing (NLP), automatic short-answer grading (ASAG) offers a promising solution to these challenges. Despite this, current ASAG algorithms are often limited in generalizability and tend to be tailored to specific questions. In this paper, we propose a unified multi-agent ASAG framework, GradeOpt, which leverages large language models (LLMs) as graders for SAGs. More importantly, GradeOpt incorporates two additional LLM-based agents - the reflector and the refiner - into the multi-agent system. This enables GradeOpt to automatically optimize the original grading guidelines by performing self-reflection on its errors. Through experiments on a challenging ASAG task, namely the grading of pedagogical content knowledge (PCK) and content knowledge (CK) questions, GradeOpt demonstrates superior performance in grading accuracy and behavior alignment with human graders compared to representative baselines. Finally, comprehensive ablation studies confirm the effectiveness of the individual components designed in GradeOpt.
Related papers
- Language Models are Few-Shot Graders [0.12289361708127876]
We present an ASAG pipeline leveraging state-of-the-art LLMs.
We compare the grading performance of three OpenAI models: GPT-4, GPT-4o, and o1-preview.
Our findings indicate that providing graded examples enhances grading accuracy, with RAG-based selection outperforming random selection.
arXiv Detail & Related papers (2025-02-18T23:38:21Z) - QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search [89.97082652805904]
We propose QLASS (Q-guided Language Agent Stepwise Search), to automatically generate annotations by estimating Q-values.
With the stepwise guidance, we propose a Q-guided generation strategy to enable language agents to better adapt to long-term value.
We empirically demonstrate that QLASS can lead to more effective decision making through qualitative analysis.
arXiv Detail & Related papers (2025-02-04T18:58:31Z) - SteLLA: A Structured Grading System Using LLMs with RAG [2.630522349105014]
We present SteLLA (Structured Grading System Using LLMs with RAG) in which a) Retrieval Augmented Generation (RAG) is used to empower LLMs on the ASAG task.
A real-world dataset that contains students' answers in an exam was collected from a college-level Biology course.
Experiments show that our proposed system can achieve substantial agreement with the human grader while providing break-down grades and feedback on all the knowledge points examined in the problem.
arXiv Detail & Related papers (2025-01-15T19:24:48Z) - ASAG2024: A Combined Benchmark for Short Answer Grading [0.10826342457160269]
Short Answer Grading (SAG) systems aim to automatically score students' answers.
There exists no comprehensive short-answer grading benchmark across different subjects, grading scales, and distributions.
We introduce the combined ASAG2024 benchmark to facilitate the comparison of automated grading systems.
arXiv Detail & Related papers (2024-09-27T09:56:02Z) - MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains [54.117238759317004]
Massive Multitask Agent Understanding (MMAU) benchmark features comprehensive offline tasks that eliminate the need for complex environment setups.
It evaluates models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics.
With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents.
arXiv Detail & Related papers (2024-07-18T00:58:41Z) - DARA: Decomposition-Alignment-Reasoning Autonomous Language Agent for Question Answering over Knowledge Graphs [70.54226917774933]
We propose the DecompositionAlignment-Reasoning Agent (DARA) framework.
DARA effectively parses questions into formal queries through a dual mechanism.
We show that DARA attains performance comparable to state-of-the-art enumerating-and-ranking-based methods for KGQA.
arXiv Detail & Related papers (2024-06-11T09:09:37Z) - Grade Like a Human: Rethinking Automated Assessment with Large Language Models [11.442433408767583]
Large language models (LLMs) have been used for automated grading, but they have not yet achieved the same level of performance as humans.
We propose an LLM-based grading system that addresses the entire grading procedure, including the following key components.
arXiv Detail & Related papers (2024-05-30T05:08:15Z) - Automated Long Answer Grading with RiceChem Dataset [19.34390869143846]
We introduce a new area of study in the field of educational Natural Language Processing: Automated Long Answer Grading (ALAG)
ALAG presents unique challenges due to the complexity and multifaceted nature of fact-based long answers.
We propose a novel approach to ALAG by formulating it as a rubric entailment problem, employing natural language inference models.
arXiv Detail & Related papers (2024-04-22T16:28:09Z) - AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents [74.16170899755281]
We introduce AgentBoard, a pioneering comprehensive benchmark and accompanied open-source evaluation framework tailored to analytical evaluation of LLM agents.
AgentBoard offers a fine-grained progress rate metric that captures incremental advancements as well as a comprehensive evaluation toolkit.
This not only sheds light on the capabilities and limitations of LLM agents but also propels the interpretability of their performance to the forefront.
arXiv Detail & Related papers (2024-01-24T01:51:00Z) - Self-RAG: Learning to Retrieve, Generate, and Critique through
Self-Reflection [74.51523859064802]
We introduce a new framework called Self-Reflective Retrieval-Augmented Generation (Self-RAG)
Self-RAG enhances an LM's quality and factuality through retrieval and self-reflection.
It significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks.
arXiv Detail & Related papers (2023-10-17T18:18:32Z) - Semantically Aligned Task Decomposition in Multi-Agent Reinforcement
Learning [56.26889258704261]
We propose a novel "disentangled" decision-making method, Semantically Aligned task decomposition in MARL (SAMA)
SAMA prompts pretrained language models with chain-of-thought that can suggest potential goals, provide suitable goal decomposition and subgoal allocation as well as self-reflection-based replanning.
SAMA demonstrates considerable advantages in sample efficiency compared to state-of-the-art ASG methods.
arXiv Detail & Related papers (2023-05-18T10:37:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.