Artificial-Intelligence Grading Assistance for Handwritten Components of a Calculus Exam
- URL: http://arxiv.org/abs/2510.05162v1
- Date: Sat, 04 Oct 2025 15:07:06 GMT
- Title: Artificial-Intelligence Grading Assistance for Handwritten Components of a Calculus Exam
- Authors: Gerd Kortemeyer, Alexander Caspar, Daria Horica,
- Abstract summary: In a large first-year exam, students' handwritten work was graded by GPT-5 against the same rubric used by teaching assistants (TAs)<n>We calibrated a human-in-the-loop filter that combines a partial-credit threshold with an Item Response Theory (2PL) risk measure based on the deviation between the AI score and the model-expected score for each student-item.<n>Unfiltered AI-TA agreement was moderate, adequate for low-stakes feedback but not for high-stakes use.
- Score: 41.99844472131922
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We investigate whether contemporary multimodal LLMs can assist with grading open-ended calculus at scale without eroding validity. In a large first-year exam, students' handwritten work was graded by GPT-5 against the same rubric used by teaching assistants (TAs), with fractional credit permitted; TA rubric decisions served as ground truth. We calibrated a human-in-the-loop filter that combines a partial-credit threshold with an Item Response Theory (2PL) risk measure based on the deviation between the AI score and the model-expected score for each student-item. Unfiltered AI-TA agreement was moderate, adequate for low-stakes feedback but not for high-stakes use. Confidence filtering made the workload-quality trade-off explicit: under stricter settings, AI delivered human-level accuracy, but also left roughly 70% of the items to be graded by humans. Psychometric patterns were constrained by low stakes on the open-ended portion, a small set of rubric checkpoints, and occasional misalignment between designated answer regions and where work appeared. Practical adjustments such as slightly higher weight and protected time, a few rubric-visible substeps, stronger spatial anchoring should raise ceiling performance. Overall, calibrated confidence and conservative routing enable AI to reliably handle a sizable subset of routine cases while reserving expert judgment for ambiguous or pedagogically rich responses.
Related papers
- Evaluating AI Grading on Real-World Handwritten College Mathematics: A Large-Scale Study Toward a Benchmark [9.922581736690159]
We present a large-scale empirical study of AI grading on real, handwritten calculus work from UC Irvine.<n>Using OCR-conditioned large language models with structured, rubric-guided prompting, our system produces scores and formative feedback for thousands of free-response quiz submissions.<n>In a setting with no single ground-truth label, we evaluate performance against official teaching-assistant grades, student surveys, and independent human review.
arXiv Detail & Related papers (2026-03-01T03:32:51Z) - Rubric-Conditioned LLM Grading: Alignment, Uncertainty, and Robustness [4.129847064263056]
We systematically evaluate the performance of Large Language Models for rubric-based short-answer grading.<n>We find that alignment is strong for binary tasks but degrades with increased rubric granularity.<n>Experiments reveal that while the model is resilient to prompt injection, it is sensitive to synonym substitutions.
arXiv Detail & Related papers (2025-12-21T05:22:04Z) - Evaluating Generative AI for CS1 Code Grading: Direct vs Reverse Methods [0.0]
This paper compares two AI-based grading techniques: textitDirect, where the AI model applies a rubric directly to student code, and textitReverse (a newly proposed approach), where the AI first fixes errors, then deduces a grade based on the nature and number of fixes.<n>We discuss the strengths and limitations of each approach, practical considerations for prompt design, and future directions for hybrid human-AI grading systems.
arXiv Detail & Related papers (2025-11-17T01:38:06Z) - Evaluating the Promise and Pitfalls of LLMs in Hiring Decisions [1.1883838320818292]
Large language models (LLMs) in hiring promise to streamline candidate screening, but it also raises serious concerns regarding accuracy and algorithmic bias.<n>We benchmark several state-of-the-art foundational LLMs and compare them with our proprietary domain-specific hiring model (Match Score) for job candidate matching.<n>Our experiments show that Match Score outperforms the general-purpose LLMs on accuracy (ROC AUC 0.85 vs 0.77) and achieves significantly more equitable outcomes across demographic groups.
arXiv Detail & Related papers (2025-07-02T19:02:18Z) - T2I-Eval-R1: Reinforcement Learning-Driven Reasoning for Interpretable Text-to-Image Evaluation [60.620408007636016]
We propose T2I-Eval-R1, a novel reinforcement learning framework that trains open-source MLLMs using only coarse-grained quality scores.<n>Our approach integrates Group Relative Policy Optimization into the instruction-tuning process, enabling models to generate both scalar scores and interpretable reasoning chains.
arXiv Detail & Related papers (2025-05-23T13:44:59Z) - ChatGPT for automated grading of short answer questions in mechanical ventilation [0.0]
Large language models (LLMs) simulate conversational language and interpret unstructured free-text responses.<n>We evaluated ChatGPT 4o to grade SAQs in a postgraduate medical setting using data from 215 students.
arXiv Detail & Related papers (2025-05-05T19:04:25Z) - PredictaBoard: Benchmarking LLM Score Predictability [50.47497036981544]
Large Language Models (LLMs) often fail unpredictably.<n>This poses a significant challenge to ensuring their safe deployment.<n>We present PredictaBoard, a novel collaborative benchmarking framework.
arXiv Detail & Related papers (2025-02-20T10:52:38Z) - Probably Approximately Precision and Recall Learning [60.00180898830079]
A key challenge in machine learning is the prevalence of one-sided feedback.<n>We introduce a Probably Approximately Correct (PAC) framework in which hypotheses are set functions that map each input to a set of labels.<n>We develop new algorithms that learn from positive data alone, achieving optimal sample complexity in the realizable case.
arXiv Detail & Related papers (2024-11-20T04:21:07Z) - Quantifying and Optimizing Global Faithfulness in Persona-driven Role-playing [37.92922713921964]
Persona-driven role-playing (PRP) aims to build AI characters that can respond to user queries by faithfully sticking with all persona statements.
This paper presents a pioneering exploration to quantify PRP faithfulness as a fine-grained and explainable criterion, which also serves as a reliable reference for optimization.
arXiv Detail & Related papers (2024-05-13T13:21:35Z) - Making Large Language Models Better Reasoners with Alignment [57.82176656663245]
Reasoning is a cognitive process of using evidence to reach a sound conclusion.
Recent studies reveal that fine-tuning LLMs on data with the chain of thought (COT) reasoning process can significantly enhance their reasoning capabilities.
We introduce an textitAlignment Fine-Tuning (AFT) paradigm, which involves three steps.
arXiv Detail & Related papers (2023-09-05T11:32:48Z) - Distractor generation for multiple-choice questions with predictive
prompting and large language models [21.233186754403093]
Large Language Models (LLMs) such as ChatGPT have demonstrated remarkable performance across various tasks.
We propose a strategy for guiding LLMs in generating relevant distractors by prompting them with question items automatically retrieved from a question bank.
We found that on average 53% of the generated distractors presented to the teachers were rated as high-quality, i.e., suitable for immediate use as is.
arXiv Detail & Related papers (2023-07-30T23:15:28Z) - Unbiased Risk Estimators Can Mislead: A Case Study of Learning with
Complementary Labels [92.98756432746482]
We study a weakly supervised problem called learning with complementary labels.
We show that the quality of gradient estimation matters more in risk minimization.
We propose a novel surrogate complementary loss(SCL) framework that trades zero bias with reduced variance.
arXiv Detail & Related papers (2020-07-05T04:19:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.