Related papers: A LLM-Powered Automatic Grading Framework with Human-Level Guidelines Optimization

A LLM-Powered Automatic Grading Framework with Human-Level Guidelines Optimization

URL: http://arxiv.org/abs/2410.02165v1
Date: Thu, 3 Oct 2024 03:11:24 GMT
Title: A LLM-Powered Automatic Grading Framework with Human-Level Guidelines Optimization
Authors: Yucheng Chu, Hang Li, Kaiqi Yang, Harry Shomer, Hui Liu, Yasemin Copur-Gencturk, Jiliang Tang,
Abstract summary: Open-ended short-answer questions (SAGs) have been widely recognized as a powerful tool for providing deeper insights into learners' responses in the context of learning analytics (LA) SAGs often present challenges in practice due to the high grading workload and concerns about inconsistent assessments. We propose a unified multi-agent ASAG framework, GradeOpt, which leverages large language models (LLMs) as graders for SAGs.
Score: 31.722907135361492
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Open-ended short-answer questions (SAGs) have been widely recognized as a powerful tool for providing deeper insights into learners' responses in the context of learning analytics (LA). However, SAGs often present challenges in practice due to the high grading workload and concerns about inconsistent assessments. With recent advancements in natural language processing (NLP), automatic short-answer grading (ASAG) offers a promising solution to these challenges. Despite this, current ASAG algorithms are often limited in generalizability and tend to be tailored to specific questions. In this paper, we propose a unified multi-agent ASAG framework, GradeOpt, which leverages large language models (LLMs) as graders for SAGs. More importantly, GradeOpt incorporates two additional LLM-based agents - the reflector and the refiner - into the multi-agent system. This enables GradeOpt to automatically optimize the original grading guidelines by performing self-reflection on its errors. Through experiments on a challenging ASAG task, namely the grading of pedagogical content knowledge (PCK) and content knowledge (CK) questions, GradeOpt demonstrates superior performance in grading accuracy and behavior alignment with human graders compared to representative baselines. Finally, comprehensive ablation studies confirm the effectiveness of the individual components designed in GradeOpt.

Related papers

Enhancing LLM-Based Short Answer Grading with Retrieval-Augmented Generation [32.12573291200363]
Large language models (LLMs) possess human-like ability in linguistic tasks. Retrieval-augmented generation (RAG) emerges as a promising solution. We propose an adaptive RAG framework for automated grading.
arXiv Detail & Related papers (2025-04-07T17:17:41Z)
LLM-based Automated Grading with Human-in-the-Loop [32.14015215819979]
Large language models (LLMs) are increasingly being used for automatic short answer grading (ASAG) In this work, we explore the potential of LLMs in ASAG tasks by leveraging their interactive capabilities through a human-in-the-loop (HITL) approach. Our proposed framework, GradeHITL, utilizes the generative properties of LLMs to pose questions to human experts, incorporating their insights to refine grading rubrics dynamically.
arXiv Detail & Related papers (2025-04-07T16:23:07Z)
LLMs Can Generate a Better Answer by Aggregating Their Own Responses [83.69632759174405]
Large Language Models (LLMs) have shown remarkable capabilities across tasks, yet they often require additional prompting techniques when facing complex problems. We argue this limitation stems from the fact that common LLM post-training procedures lack explicit supervision for discriminative judgment tasks. We propose Generative Self-Aggregation (GSA), a novel prompting method that improves answer quality without requiring the model's discriminative capabilities.
arXiv Detail & Related papers (2025-03-06T05:25:43Z)
QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search [89.97082652805904]
We propose QLASS (Q-guided Language Agent Stepwise Search), to automatically generate annotations by estimating Q-values. With the stepwise guidance, we propose a Q-guided generation strategy to enable language agents to better adapt to long-term value. We empirically demonstrate that QLASS can lead to more effective decision making through qualitative analysis.
arXiv Detail & Related papers (2025-02-04T18:58:31Z)
SteLLA: A Structured Grading System Using LLMs with RAG [2.630522349105014]
We present SteLLA (Structured Grading System Using LLMs with RAG) in which a) Retrieval Augmented Generation (RAG) is used to empower LLMs on the ASAG task. A real-world dataset that contains students' answers in an exam was collected from a college-level Biology course. Experiments show that our proposed system can achieve substantial agreement with the human grader while providing break-down grades and feedback on all the knowledge points examined in the problem.
arXiv Detail & Related papers (2025-01-15T19:24:48Z)
ASAG2024: A Combined Benchmark for Short Answer Grading [0.10826342457160269]
Short Answer Grading (SAG) systems aim to automatically score students' answers. There exists no comprehensive short-answer grading benchmark across different subjects, grading scales, and distributions. We introduce the combined ASAG2024 benchmark to facilitate the comparison of automated grading systems.
arXiv Detail & Related papers (2024-09-27T09:56:02Z)
MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains [54.117238759317004]
Massive Multitask Agent Understanding (MMAU) benchmark features comprehensive offline tasks that eliminate the need for complex environment setups. It evaluates models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics. With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents.
arXiv Detail & Related papers (2024-07-18T00:58:41Z)
AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models [95.09157454599605]
Large Language Models (LLMs) are becoming increasingly powerful, but they still exhibit significant but subtle weaknesses. Traditional benchmarking approaches cannot thoroughly pinpoint specific model deficiencies. We introduce a unified framework, AutoDetect, to automatically expose weaknesses in LLMs across various tasks.
arXiv Detail & Related papers (2024-06-24T15:16:45Z)
DARA: Decomposition-Alignment-Reasoning Autonomous Language Agent for Question Answering over Knowledge Graphs [70.54226917774933]
We propose the DecompositionAlignment-Reasoning Agent (DARA) framework. DARA effectively parses questions into formal queries through a dual mechanism. We show that DARA attains performance comparable to state-of-the-art enumerating-and-ranking-based methods for KGQA.
arXiv Detail & Related papers (2024-06-11T09:09:37Z)
Grade Like a Human: Rethinking Automated Assessment with Large Language Models [11.442433408767583]
Large language models (LLMs) have been used for automated grading, but they have not yet achieved the same level of performance as humans. We propose an LLM-based grading system that addresses the entire grading procedure, including the following key components.
arXiv Detail & Related papers (2024-05-30T05:08:15Z)
Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation [9.390902237835457]
We propose a new method to measure the task-specific accuracy of Retrieval-Augmented Large Language Models (RAG) Evaluation is performed by scoring the RAG on an automatically-generated synthetic exam composed of multiple choice questions.
arXiv Detail & Related papers (2024-05-22T13:14:11Z)
Automated Long Answer Grading with RiceChem Dataset [19.34390869143846]
We introduce a new area of study in the field of educational Natural Language Processing: Automated Long Answer Grading (ALAG) ALAG presents unique challenges due to the complexity and multifaceted nature of fact-based long answers. We propose a novel approach to ALAG by formulating it as a rubric entailment problem, employing natural language inference models.
arXiv Detail & Related papers (2024-04-22T16:28:09Z)
AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents [76.95062553043607]
evaluating large language models (LLMs) is essential for understanding their capabilities and facilitating their integration into practical applications. We introduce AgentBoard, a pioneering comprehensive benchmark and accompanied open-source evaluation framework tailored to analytical evaluation of LLM agents.
arXiv Detail & Related papers (2024-01-24T01:51:00Z)
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection [74.51523859064802]
We introduce a new framework called Self-Reflective Retrieval-Augmented Generation (Self-RAG) Self-RAG enhances an LM's quality and factuality through retrieval and self-reflection. It significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks.
arXiv Detail & Related papers (2023-10-17T18:18:32Z)
Towards LLM-based Autograding for Short Textual Answers [4.853810201626855]
This manuscript is an evaluation of a large language model for the purpose of autograding. Our findings suggest that while "out-of-the-box" LLMs provide a valuable tool, their readiness for independent automated grading remains a work in progress.
arXiv Detail & Related papers (2023-09-09T22:25:56Z)
Semantically Aligned Task Decomposition in Multi-Agent Reinforcement Learning [56.26889258704261]
We propose a novel "disentangled" decision-making method, Semantically Aligned task decomposition in MARL (SAMA) SAMA prompts pretrained language models with chain-of-thought that can suggest potential goals, provide suitable goal decomposition and subgoal allocation as well as self-reflection-based replanning. SAMA demonstrates considerable advantages in sample efficiency compared to state-of-the-art ASG methods.
arXiv Detail & Related papers (2023-05-18T10:37:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.