Related papers: Artificial-Intelligence Grading Assistance for Handwritten Components of a Calculus Exam

Artificial-Intelligence Grading Assistance for Handwritten Components of a Calculus Exam

URL: http://arxiv.org/abs/2510.05162v1
Date: Sat, 04 Oct 2025 15:07:06 GMT
Title: Artificial-Intelligence Grading Assistance for Handwritten Components of a Calculus Exam
Authors: Gerd Kortemeyer, Alexander Caspar, Daria Horica,
Abstract summary: In a large first-year exam, students' handwritten work was graded by GPT-5 against the same rubric used by teaching assistants (TAs)<n>We calibrated a human-in-the-loop filter that combines a partial-credit threshold with an Item Response Theory (2PL) risk measure based on the deviation between the AI score and the model-expected score for each student-item.<n>Unfiltered AI-TA agreement was moderate, adequate for low-stakes feedback but not for high-stakes use.
Score: 41.99844472131922
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We investigate whether contemporary multimodal LLMs can assist with grading open-ended calculus at scale without eroding validity. In a large first-year exam, students' handwritten work was graded by GPT-5 against the same rubric used by teaching assistants (TAs), with fractional credit permitted; TA rubric decisions served as ground truth. We calibrated a human-in-the-loop filter that combines a partial-credit threshold with an Item Response Theory (2PL) risk measure based on the deviation between the AI score and the model-expected score for each student-item. Unfiltered AI-TA agreement was moderate, adequate for low-stakes feedback but not for high-stakes use. Confidence filtering made the workload-quality trade-off explicit: under stricter settings, AI delivered human-level accuracy, but also left roughly 70% of the items to be graded by humans. Psychometric patterns were constrained by low stakes on the open-ended portion, a small set of rubric checkpoints, and occasional misalignment between designated answer regions and where work appeared. Practical adjustments such as slightly higher weight and protected time, a few rubric-visible substeps, stronger spatial anchoring should raise ceiling performance. Overall, calibrated confidence and conservative routing enable AI to reliably handle a sizable subset of routine cases while reserving expert judgment for ambiguous or pedagogically rich responses.

Related papers

Evaluating AI Grading on Real-World Handwritten College Mathematics: A Large-Scale Study Toward a Benchmark [9.922581736690159]
We present a large-scale empirical study of AI grading on real, handwritten calculus work from UC Irvine.<n>Using OCR-conditioned large language models with structured, rubric-guided prompting, our system produces scores and formative feedback for thousands of free-response quiz submissions.<n>In a setting with no single ground-truth label, we evaluate performance against official teaching-assistant grades, student surveys, and independent human review.
arXiv Detail & Related papers (2026-03-01T03:32:51Z)
Rubric-Conditioned LLM Grading: Alignment, Uncertainty, and Robustness [4.129847064263056]
We systematically evaluate the performance of Large Language Models for rubric-based short-answer grading.<n>We find that alignment is strong for binary tasks but degrades with increased rubric granularity.<n>Experiments reveal that while the model is resilient to prompt injection, it is sensitive to synonym substitutions.
arXiv Detail & Related papers (2025-12-21T05:22:04Z)
Evaluating Generative AI for CS1 Code Grading: Direct vs Reverse Methods [0.0]
This paper compares two AI-based grading techniques: textitDirect, where the AI model applies a rubric directly to student code, and textitReverse (a newly proposed approach), where the AI first fixes errors, then deduces a grade based on the nature and number of fixes.<n>We discuss the strengths and limitations of each approach, practical considerations for prompt design, and future directions for hybrid human-AI grading systems.
arXiv Detail & Related papers (2025-11-17T01:38:06Z)
Evaluating the Promise and Pitfalls of LLMs in Hiring Decisions [1.1883838320818292]
Large language models (LLMs) in hiring promise to streamline candidate screening, but it also raises serious concerns regarding accuracy and algorithmic bias.<n>We benchmark several state-of-the-art foundational LLMs and compare them with our proprietary domain-specific hiring model (Match Score) for job candidate matching.<n>Our experiments show that Match Score outperforms the general-purpose LLMs on accuracy (ROC AUC 0.85 vs 0.77) and achieves significantly more equitable outcomes across demographic groups.
arXiv Detail & Related papers (2025-07-02T19:02:18Z)
T2I-Eval-R1: Reinforcement Learning-Driven Reasoning for Interpretable Text-to-Image Evaluation [60.620408007636016]
We propose T2I-Eval-R1, a novel reinforcement learning framework that trains open-source MLLMs using only coarse-grained quality scores.<n>Our approach integrates Group Relative Policy Optimization into the instruction-tuning process, enabling models to generate both scalar scores and interpretable reasoning chains.
arXiv Detail & Related papers (2025-05-23T13:44:59Z)
ChatGPT for automated grading of short answer questions in mechanical ventilation [0.0]
Large language models (LLMs) simulate conversational language and interpret unstructured free-text responses.<n>We evaluated ChatGPT 4o to grade SAQs in a postgraduate medical setting using data from 215 students.
arXiv Detail & Related papers (2025-05-05T19:04:25Z)
PredictaBoard: Benchmarking LLM Score Predictability [50.47497036981544]
Large Language Models (LLMs) often fail unpredictably.<n>This poses a significant challenge to ensuring their safe deployment.<n>We present PredictaBoard, a novel collaborative benchmarking framework.
arXiv Detail & Related papers (2025-02-20T10:52:38Z)
Probably Approximately Precision and Recall Learning [60.00180898830079]
A key challenge in machine learning is the prevalence of one-sided feedback.<n>We introduce a Probably Approximately Correct (PAC) framework in which hypotheses are set functions that map each input to a set of labels.<n>We develop new algorithms that learn from positive data alone, achieving optimal sample complexity in the realizable case.
arXiv Detail & Related papers (2024-11-20T04:21:07Z)
Quantifying and Optimizing Global Faithfulness in Persona-driven Role-playing [37.92922713921964]
Persona-driven role-playing (PRP) aims to build AI characters that can respond to user queries by faithfully sticking with all persona statements. This paper presents a pioneering exploration to quantify PRP faithfulness as a fine-grained and explainable criterion, which also serves as a reliable reference for optimization.
arXiv Detail & Related papers (2024-05-13T13:21:35Z)
Making Large Language Models Better Reasoners with Alignment [57.82176656663245]
Reasoning is a cognitive process of using evidence to reach a sound conclusion. Recent studies reveal that fine-tuning LLMs on data with the chain of thought (COT) reasoning process can significantly enhance their reasoning capabilities. We introduce an textitAlignment Fine-Tuning (AFT) paradigm, which involves three steps.
arXiv Detail & Related papers (2023-09-05T11:32:48Z)
Distractor generation for multiple-choice questions with predictive prompting and large language models [21.233186754403093]
Large Language Models (LLMs) such as ChatGPT have demonstrated remarkable performance across various tasks. We propose a strategy for guiding LLMs in generating relevant distractors by prompting them with question items automatically retrieved from a question bank. We found that on average 53% of the generated distractors presented to the teachers were rated as high-quality, i.e., suitable for immediate use as is.
arXiv Detail & Related papers (2023-07-30T23:15:28Z)
Unbiased Risk Estimators Can Mislead: A Case Study of Learning with Complementary Labels [92.98756432746482]
We study a weakly supervised problem called learning with complementary labels. We show that the quality of gradient estimation matters more in risk minimization. We propose a novel surrogate complementary loss(SCL) framework that trades zero bias with reduced variance.
arXiv Detail & Related papers (2020-07-05T04:19:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.