Related papers: Evaluating AI Grading on Real-World Handwritten College Mathematics: A Large-Scale Study Toward a Benchmark

Evaluating AI Grading on Real-World Handwritten College Mathematics: A Large-Scale Study Toward a Benchmark

URL: http://arxiv.org/abs/2603.00895v1
Date: Sun, 01 Mar 2026 03:32:51 GMT
Title: Evaluating AI Grading on Real-World Handwritten College Mathematics: A Large-Scale Study Toward a Benchmark
Authors: Zhiqi Yu, Xingping Liu, Haobin Mao, Mingshuo Liu, Long Chen, Jack Xin, Yifeng Yu,
Abstract summary: We present a large-scale empirical study of AI grading on real, handwritten calculus work from UC Irvine.<n>Using OCR-conditioned large language models with structured, rubric-guided prompting, our system produces scores and formative feedback for thousands of free-response quiz submissions.<n>In a setting with no single ground-truth label, we evaluate performance against official teaching-assistant grades, student surveys, and independent human review.
Score: 9.922581736690159
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Grading in large undergraduate STEM courses often yields minimal feedback due to heavy instructional workloads. We present a large-scale empirical study of AI grading on real, handwritten single-variable calculus work from UC Irvine. Using OCR-conditioned large language models with structured, rubric-guided prompting, our system produces scores and formative feedback for thousands of free-response quiz submissions from nearly 800 students. In a setting with no single ground-truth label, we evaluate performance against official teaching-assistant grades, student surveys, and independent human review, finding strong alignment with TA scoring and a large majority of AI-generated feedback rated as correct or acceptable across quizzes. Beyond calculus, this setting highlights core challenges in OCR-conditioned mathematical reasoning and partial-credit assessment. We analyze key failure modes, propose practical rubric- and prompt-design principles, and introduce a multi-perspective evaluation protocol for reliable, real-course deployment. Building on the dataset and evaluation framework developed here, we outline a standardized benchmark for AI grading of handwritten mathematics to support reproducible comparison and future research.

Related papers

Beyond Holistic Scores: Automatic Trait-Based Quality Scoring of Argumentative Essays [15.895792302323883]
In educational contexts, teachers and learners require interpretable, trait-level feedback.<n>We study trait-based Automatic Argumentative Essay Scoring using two complementary modeling paradigms.<n>We show that explicitly modeling score ordinality substantially improves agreement with human raters.
arXiv Detail & Related papers (2026-02-04T14:30:52Z)
EduResearchBench: A Hierarchical Atomic Task Decomposition Benchmark for Full-Lifecycle Educational Research [21.988207602041182]
We introduce EduResearchBench, the first comprehensive evaluation platform dedicated to academic academic writing.<n>EduResearchBench is built upon our Hierarchical Atomic Task Decomposition (HATD) framework.<n>We propose a curriculum learning strategy that progressively builds competence from foundational skills to complex methodological reasoning and argumentation.
arXiv Detail & Related papers (2026-01-22T09:52:30Z)
RPC-Bench: A Fine-grained Benchmark for Research Paper Comprehension [65.81339691942757]
RPC-Bench is a large-scale question-answering benchmark built from review-rebuttal exchanges of high-quality computer science papers.<n>We design a fine-grained taxonomy aligned with the scientific research flow to assess models' ability to understand and answer why, what, and how questions in scholarly contexts.
arXiv Detail & Related papers (2026-01-14T11:37:00Z)
Exposía: Academic Writing Assessment of Exposés and Peer Feedback [56.428320613219306]
We present Exposa, the first public dataset that connects writing and feedback assessment in higher education.<n>We use Exposa to benchmark state-of-the-art open-source large language models (LLMs) for two tasks: automated scoring of (1) the proposals and (2) the student reviews.
arXiv Detail & Related papers (2026-01-10T11:33:26Z)
CoCoNUTS: Concentrating on Content while Neglecting Uninformative Textual Styles for AI-Generated Peer Review Detection [60.52240468810558]
We introduce CoCoNUTS, a content-oriented benchmark built upon a fine-grained dataset of AI-generated peer reviews.<n>We also develop CoCoDet, an AI review detector via a multi-task learning framework, to achieve more accurate and robust detection of AI involvement in review content.
arXiv Detail & Related papers (2025-08-28T06:03:11Z)
Teaching at Scale: Leveraging AI to Evaluate and Elevate Engineering Education [3.557803321422781]
This article presents a scalable, AI-supported framework for qualitative student feedback using large language models.<n>The system employs hierarchical summarization, anonymization, and exception handling to extract actionable themes from open-ended comments.<n>We report on its successful deployment across a large college of engineering.
arXiv Detail & Related papers (2025-08-01T20:27:40Z)
RAG-Zeval: Towards Robust and Interpretable Evaluation on RAG Responses through End-to-End Rule-Guided Reasoning [64.46921169261852]
RAG-Zeval is a novel end-to-end framework that formulates faithfulness and correctness evaluation as a rule-guided reasoning task.<n>Our approach trains evaluators with reinforcement learning, facilitating compact models to generate comprehensive and sound assessments.<n>Experiments demonstrate RAG-Zeval's superior performance, achieving the strongest correlation with human judgments.
arXiv Detail & Related papers (2025-05-28T14:55:33Z)
Ratas framework: A comprehensive genai-based approach to rubric-based marking of real-world textual exams [3.4132239125074206]
RATAS (Rubric Automated Tree-based Answer Scoring) is a novel framework that leverages state-of-the-art generative AI models for rubric-based grading of textual responses.<n> RATAS is designed to support a wide range of grading rubrics, enable subject-agnostic evaluation, and generate structured, explainable rationales for assigned scores.
arXiv Detail & Related papers (2025-05-27T22:17:27Z)
XtraGPT: Context-Aware and Controllable Academic Paper Revision [43.263488839387584]
We propose a human-AI collaboration framework for academic paper revision centered on criteria-guided intent alignment and context-aware modeling.<n>We instantiate the framework in XtraGPT, the first suite of open-source LLMs for context-aware, instruction-guided writing assistance.
arXiv Detail & Related papers (2025-05-16T15:02:19Z)
Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework [61.38174427966444]
Large Language Models (LLMs) are being used more and more extensively for automated evaluation in various scenarios.<n>Previous studies have attempted to fine-tune open-source LLMs to replicate the evaluation explanations and judgments of powerful proprietary models.<n>We propose a novel evaluation framework, ARJudge, that adaptively formulates evaluation criteria and synthesizes both text-based and code-driven analyses.
arXiv Detail & Related papers (2025-02-26T06:31:45Z)
Benchmarking Large Language Models for Conversational Question Answering in Multi-instructional Documents [61.41316121093604]
We present InsCoQA, a novel benchmark for evaluating large language models (LLMs) in the context of conversational question answering (CQA) Sourced from extensive, encyclopedia-style instructional content, InsCoQA assesses models on their ability to retrieve, interpret, and accurately summarize procedural guidance from multiple documents. We also propose InsEval, an LLM-assisted evaluator that measures the integrity and accuracy of generated responses and procedural instructions.
arXiv Detail & Related papers (2024-10-01T09:10:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.