An LLM -Powered Assessment Retrieval-Augmented Generation (RAG) For Higher Education
- URL: http://arxiv.org/abs/2601.06141v1
- Date: Mon, 05 Jan 2026 15:13:00 GMT
- Title: An LLM -Powered Assessment Retrieval-Augmented Generation (RAG) For Higher Education
- Authors: Reza Vatankhah Barenji, Nazila Salimi, Sina Khoshgoftar,
- Abstract summary: This study presents an agentic assessment system built on a Retrieval-Augmented Generation architecture.<n>The system integrates a large language model with a structured retrieval mechanism that accesses rubric criteria, exemplar essays, and instructor feedback.<n>Results demonstrate that the RAG system can produce reliable, rubric-aligned feedback at scale, achieving 94--99% agreement with human evaluators.
- Score: 0.0
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Providing timely, consistent, and high-quality feedback in large-scale higher education courses remains a persistent challenge, often constrained by instructor workload and resource limitations. This study presents an LLM-powered, agentic assessment system built on a Retrieval-Augmented Generation (RAG) architecture to address these challenges. The system integrates a large language model with a structured retrieval mechanism that accesses rubric criteria, exemplar essays, and instructor feedback to generate contextually grounded grades and formative comments. A mixed-methods evaluation was conducted using 701 student essays, combining quantitative analyses of inter-rater reliability, scoring alignment, and consistency with instructor assessments, alongside qualitative evaluation of feedback quality, pedagogical relevance, and student support. Results demonstrate that the RAG system can produce reliable, rubric-aligned feedback at scale, achieving 94--99% agreement with human evaluators, while also enhancing students' opportunities for self-regulated learning and engagement with assessment criteria. The discussion highlights both pedagogical limitations, including potential constraints on originality and feedback dialogue, and the transformative potential of RAG systems to augment instructors' capabilities, streamline assessment workflows, and support scalable, adaptive learning environments. This research contributes empirical evidence for the application of agentic AI in higher education, offering a scalable and pedagogically informed model for enhancing feedback accessibility, consistency, and quality.
Related papers
- Evaluation of Large Language Models' educational feedback in Higher Education: potential, limitations and implications for educational practice [0.0]
This study examines how AI-generated feedback supports student learning using a well-established analytical framework.<n>The evaluation process involved providing seven Large Language Models with a structured rubric, which defined specific criteria and performance levels.<n>Overall, these findings indicate that LLMs can generate well-structured feedback and hold great potential as a sustainable and meaningful feedback tool.
arXiv Detail & Related papers (2026-01-24T14:30:25Z) - Author-in-the-Loop Response Generation and Evaluation: Integrating Author Expertise and Intent in Responses to Peer Review [53.99984738447279]
Recent work frames this task as automatic text generation, underusing author expertise and intent.<n>We introduce REspGen, a generation framework that integrates explicit author input, multi-attribute control, and evaluation-guided refinement.<n>To support this formulation, we construct Re$3$Align, the first large-scale dataset of aligned review-response--revision triplets.
arXiv Detail & Related papers (2026-01-19T14:07:10Z) - AGACCI : Affiliated Grading Agents for Criteria-Centric Interface in Educational Coding Contexts [0.6050976240234864]
We introduce AGACCI, a multi-agent system that distributes specialized evaluation roles across collaborative agents.<n>AGACCI outperforms a single GPT-based baseline in terms of rubric and feedback accuracy, relevance, consistency, and coherence.
arXiv Detail & Related papers (2025-07-07T15:50:46Z) - The AI Imperative: Scaling High-Quality Peer Review in Machine Learning [49.87236114682497]
We argue that AI-assisted peer review must become an urgent research and infrastructure priority.<n>We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making.
arXiv Detail & Related papers (2025-06-09T18:37:14Z) - RAG-Zeval: Towards Robust and Interpretable Evaluation on RAG Responses through End-to-End Rule-Guided Reasoning [64.46921169261852]
RAG-Zeval is a novel end-to-end framework that formulates faithfulness and correctness evaluation as a rule-guided reasoning task.<n>Our approach trains evaluators with reinforcement learning, facilitating compact models to generate comprehensive and sound assessments.<n>Experiments demonstrate RAG-Zeval's superior performance, achieving the strongest correlation with human judgments.
arXiv Detail & Related papers (2025-05-28T14:55:33Z) - Enhancing LLM-Based Short Answer Grading with Retrieval-Augmented Generation [32.12573291200363]
Large language models (LLMs) possess human-like ability in linguistic tasks.<n>Retrieval-augmented generation (RAG) emerges as a promising solution.<n>We propose an adaptive RAG framework for automated grading.
arXiv Detail & Related papers (2025-04-07T17:17:41Z) - A Zero-Shot LLM Framework for Automatic Assignment Grading in Higher Education [0.6141800972050401]
We propose a Zero-Shot Large Language Model (LLM)-Based Automated Assignment Grading (AAG) system.<n>This framework leverages prompt engineering to evaluate both computational and explanatory student responses without requiring additional training or fine-tuning.<n>The AAG system delivers tailored feedback that highlights individual strengths and areas for improvement, thereby enhancing student learning outcomes.
arXiv Detail & Related papers (2025-01-24T08:01:41Z) - HREF: Human Response-Guided Evaluation of Instruction Following in Language Models [61.273153125847166]
We develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF)<n>In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination.<n>We study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template.
arXiv Detail & Related papers (2024-12-20T03:26:47Z) - AERA Chat: An Interactive Platform for Automated Explainable Student Answer Assessment [15.969280805269976]
AERA Chat is an interactive visualization platform designed for automated explainable student answer assessment.<n>AERA Chat leverages multiple language models (LLMs) to concurrently score student answers and generate explanatory rationales.<n>We demonstrate the effectiveness of our platform through evaluations of multiple rationale-generation methods on several datasets.
arXiv Detail & Related papers (2024-10-12T11:57:53Z) - ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models [53.00812898384698]
We argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking.
We highlight how cognitive biases can conflate fluent information and truthfulness, and how cognitive uncertainty affects the reliability of rating scores such as Likert.
We propose the ConSiDERS-The-Human evaluation framework consisting of 6 pillars -- Consistency, Scoring Criteria, Differentiating, User Experience, Responsible, and Scalability.
arXiv Detail & Related papers (2024-05-28T22:45:28Z) - Modelling Assessment Rubrics through Bayesian Networks: a Pragmatic Approach [40.06500618820166]
This paper presents an approach to deriving a learner model directly from an assessment rubric.
We illustrate how the approach can be applied to automatize the human assessment of an activity developed for testing computational thinking skills.
arXiv Detail & Related papers (2022-09-07T10:09:12Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.