Related papers: AI-Enabled grading with near-domain data for scaling feedback with human-level accuracy

AI-Enabled grading with near-domain data for scaling feedback with human-level accuracy

URL: http://arxiv.org/abs/2512.04113v1
Date: Mon, 01 Dec 2025 05:11:37 GMT
Title: AI-Enabled grading with near-domain data for scaling feedback with human-level accuracy
Authors: Shyam Agarwal, Ali Moghimi, Kevin C. Haudek,
Abstract summary: This paper proposes a novel and practical approach to grade short-answer constructed-response questions.<n>Our framework does not require pre-written grading rubrics and is designed explicitly with practical classroom settings in mind.
Score: 0.5735035463793009
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Constructed-response questions are crucial to encourage generative processing and test a learner's understanding of core concepts. However, the limited availability of instructor time, large class sizes, and other resource constraints pose significant challenges in providing timely and detailed evaluation, which is crucial for a holistic educational experience. In addition, providing timely and frequent assessments is challenging since manual grading is labor intensive, and automated grading is complex to generalize to every possible response scenario. This paper proposes a novel and practical approach to grade short-answer constructed-response questions. We discuss why this problem is challenging, define the nature of questions on which our method works, and finally propose a framework that instructors can use to evaluate their students' open-responses, utilizing near-domain data like data from similar questions administered in previous years. The proposed method outperforms the state of the art machine learning models as well as non-fine-tuned large language models like GPT 3.5, GPT 4, and GPT 4o by a considerable margin of over 10-20% in some cases, even after providing the LLMs with reference/model answers. Our framework does not require pre-written grading rubrics and is designed explicitly with practical classroom settings in mind. Our results also reveal exciting insights about learning from near-domain data, including what we term as accuracy and data advantages using human-labeled data, and we believe this is the first work to formalize the problem of automated short answer grading based on the near-domain data.

Related papers

Embedding Domain Knowledge for Large Language Models via Reinforcement Learning from Augmented Generation [18.99847259801634]
We propose Reinforcement Learning from Augmented Generation (RLAG) to embed domain knowledge into large language models.<n>Our approach iteratively cycles between sampling generations and optimize the model through calculated rewards.<n> Experimental results across medical, legal, astronomy, and current events datasets demonstrate that our proposed method significantly outperforms baseline approaches.
arXiv Detail & Related papers (2025-09-24T14:30:16Z)
Automatic Question & Answer Generation Using Generative Large Language Model (LLM) [0.0]
This research proposes to leverage unsupervised learning methods in NLP, primarily focusing on the English language.<n>A customized model will offer efficient solutions for educators, instructors, and individuals engaged in text-based evaluations.
arXiv Detail & Related papers (2025-08-26T23:36:13Z)
Does Machine Unlearning Truly Remove Knowledge? [80.83986295685128]
We introduce a comprehensive auditing framework for unlearning evaluation comprising three benchmark datasets, six unlearning algorithms, and five prompt-based auditing methods.<n>We evaluate the effectiveness and robustness of different unlearning strategies.
arXiv Detail & Related papers (2025-05-29T09:19:07Z)
SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation. Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z)
"I understand why I got this grade": Automatic Short Answer Grading with Feedback [33.63970664152288]
We introduce Engineering Short Answer Feedback (EngSAF), a dataset designed for automatic short-answer grading with feedback.<n>We incorporate feedback into our dataset by leveraging the generative capabilities of state-of-the-art large language models (LLMs) using our Label-Aware Synthetic Feedback Generation (LASFG) strategy.<n>The best-performing model (Mistral-7B) achieves an overall accuracy of 75.4% and 58.7% on unseen answers and unseen question test sets, respectively.
arXiv Detail & Related papers (2024-06-30T15:42:18Z)
Deep Learning-Based Object Pose Estimation: A Comprehensive Survey [73.74933379151419]
We discuss the recent advances in deep learning-based object pose estimation. Our survey also covers multiple input data modalities, degrees-of-freedom of output poses, object properties, and downstream tasks.
arXiv Detail & Related papers (2024-05-13T14:44:22Z)
SyllabusQA: A Course Logistics Question Answering Dataset [45.90423821963144]
We introduce SyllabusQA, an open-source dataset with 63 real course syllabi covering 36 majors, containing 5,078 open-ended course logistics-related question-answer pairs. We benchmark several strong baselines on this task, from large language model prompting to retrieval-augmented generation. We find that despite performing close to humans on traditional metrics of textual similarity, there remains a significant gap between automated approaches and humans in terms of fact precision.
arXiv Detail & Related papers (2024-03-03T03:01:14Z)
Zero-shot Retrieval: Augmenting Pre-trained Models with Search Engines [83.65380507372483]
Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box. This paper shows how to leverage recent advances in NLP and multi-modal learning to augment a pre-trained model with search engine retrieval.
arXiv Detail & Related papers (2023-11-29T05:33:28Z)
Automatic Short Math Answer Grading via In-context Meta-learning [2.0263791972068628]
We study the problem of automatic short answer grading for students' responses to math questions. We use MathBERT, a variant of the popular language model BERT adapted to mathematical content, as our base model. Second, we use an in-context learning approach that provides scoring examples as input to the language model.
arXiv Detail & Related papers (2022-05-30T16:26:02Z)
ProtoTransformer: A Meta-Learning Approach to Providing Student Feedback [54.142719510638614]
In this paper, we frame the problem of providing feedback as few-shot classification. A meta-learner adapts to give feedback to student code on a new programming question from just a few examples by instructors. Our approach was successfully deployed to deliver feedback to 16,000 student exam-solutions in a programming course offered by a tier 1 university.
arXiv Detail & Related papers (2021-07-23T22:41:28Z)
The World is Not Binary: Learning to Rank with Grayscale Data for Dialogue Response Selection [55.390442067381755]
We show that grayscale data can be automatically constructed without human effort. Our method employs off-the-shelf response retrieval models and response generation models as automatic grayscale data generators. Experiments on three benchmark datasets and four state-of-the-art matching models show that the proposed approach brings significant and consistent performance improvements.
arXiv Detail & Related papers (2020-04-06T06:34:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.