Related papers: Assisting the Grading of a Handwritten General Chemistry Exam with Artificial Intelligence

Assisting the Grading of a Handwritten General Chemistry Exam with Artificial Intelligence

URL: http://arxiv.org/abs/2509.10591v2
Date: Mon, 10 Nov 2025 12:37:31 GMT
Title: Assisting the Grading of a Handwritten General Chemistry Exam with Artificial Intelligence
Authors: Jan Cvengros, Gerd Kortemeyer,
Abstract summary: We explore the effectiveness and reliability of an artificial intelligence (AI)-based grading system for a handwritten general chemistry exam.<n>We compare AI-assigned scores to human grading across various types of questions.<n>The results indicate promising applications for AI in routine assessment tasks.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We explore the effectiveness and reliability of an artificial intelligence (AI)-based grading system for a handwritten general chemistry exam, comparing AI-assigned scores to human grading across various types of questions. Exam pages and grading rubrics were uploaded as images to account for chemical reaction equations, short and long open-ended answers, numerical and symbolic answer derivations, drawing, and sketching in pencil-and-paper format. Using linear regression analyses and psychometric evaluations, the investigation reveals high agreement between AI and human graders for textual and chemical reaction questions, while highlighting lower reliability for numerical and graphical tasks. The findings emphasize the necessity for human oversight to ensure grading accuracy, based on selective filtering. The results indicate promising applications for AI in routine assessment tasks, though careful consideration must be given to student perceptions of fairness and trust in integrating AI-based grading into educational practice.

Related papers

Evaluating AI Grading on Real-World Handwritten College Mathematics: A Large-Scale Study Toward a Benchmark [9.922581736690159]
We present a large-scale empirical study of AI grading on real, handwritten calculus work from UC Irvine.<n>Using OCR-conditioned large language models with structured, rubric-guided prompting, our system produces scores and formative feedback for thousands of free-response quiz submissions.<n>In a setting with no single ground-truth label, we evaluate performance against official teaching-assistant grades, student surveys, and independent human review.
arXiv Detail & Related papers (2026-03-01T03:32:51Z)
AgenticIQA: An Agentic Framework for Adaptive and Interpretable Image Quality Assessment [69.06977852423564]
Image quality assessment (IQA) reflects both the quantification and interpretation of perceptual quality rooted in the human visual system.<n>AgenticIQA decomposes IQA into four subtasks -- distortion detection, distortion analysis, tool selection, and tool execution.<n>To support training and evaluation, we introduce AgenticIQA-200K, a large-scale instruction dataset tailored for IQA agents, and AgenticIQA-Eval, the first benchmark for assessing the planning, execution, and summarization capabilities of VLM-based IQA agents.
arXiv Detail & Related papers (2025-09-30T09:37:01Z)
Evaluating LLM-Generated Q&A Test: a Student-Centered Study [0.06749750044497731]
We automatically generated a GPT-4o-mini-based Q&A test for a Natural Language Processing course and evaluated its psychometric and perceived-quality metrics with students and experts.<n>A mixed-format IRT analysis showed that the generated items exhibit strong discrimination and appropriate difficulty, while student and expert star ratings reflect high overall quality.
arXiv Detail & Related papers (2025-05-10T10:47:23Z)
General Scales Unlock AI Evaluation with Explanatory and Predictive Power [57.7995945974989]
benchmarking has guided progress in AI, but it has offered limited explanatory and predictive power for general-purpose AI systems.<n>We introduce general scales for AI evaluation that can explain what common AI benchmarks really measure.<n>Our fully-automated methodology builds on 18 newly-crafted rubrics that place instance demands on general scales that do not saturate.
arXiv Detail & Related papers (2025-03-09T01:13:56Z)
Validity Arguments For Constructed Response Scoring Using Generative Artificial Intelligence Applications [0.0]
generative AI is particularly appealing because it reduces the effort required for handcrafting features in traditional AI scoring.<n>We compare the validity evidence needed in scoring systems using human ratings, feature-based natural language processing AI scoring engines, and generative AI.
arXiv Detail & Related papers (2025-01-04T16:59:29Z)
Applying IRT to Distinguish Between Human and Generative AI Responses to Multiple-Choice Assessments [0.0]
Despite the widespread use of multiple-choice questions in assessments, the detection of AI cheating has been almost unexplored.<n>We propose a method based on the application of Item Response Theory to address this gap.<n>Our approach operates on the assumption that artificial and human intelligence exhibit different response patterns.
arXiv Detail & Related papers (2024-11-28T09:43:06Z)
Evaluating GPT-4 at Grading Handwritten Solutions in Math Exams [48.99818550820575]
We leverage state-of-the-art multi-modal AI models, in particular GPT-4o, to automatically grade handwritten responses to college-level math exams.<n>Using real student responses to questions in a probability theory exam, we evaluate GPT-4o's alignment with ground-truth scores from human graders using various prompting techniques.
arXiv Detail & Related papers (2024-11-07T22:51:47Z)
Quality Assessment for AI Generated Images with Instruction Tuning [58.41087653543607]
We first establish a novel Image Quality Assessment (IQA) database for AIGIs, termed AIGCIQA2023+.<n>This paper presents a MINT-IQA model to evaluate and explain human preferences for AIGIs from Multi-perspectives with INstruction Tuning.
arXiv Detail & Related papers (2024-05-12T17:45:11Z)
ChemMiner: A Large Language Model Agent System for Chemical Literature Data Mining [56.15126714863963]
ChemMiner is an end-to-end framework for extracting chemical data from literature.<n>ChemMiner incorporates three specialized agents: a text analysis agent for coreference mapping, a multimodal agent for non-textual information extraction, and a synthesis analysis agent for data generation.<n> Experimental results demonstrate reaction identification rates comparable to human chemists while significantly reducing processing time, with high accuracy, recall, and F1 scores.
arXiv Detail & Related papers (2024-02-20T13:21:46Z)
Chemist-X: Large Language Model-empowered Agent for Reaction Condition Recommendation in Chemical Synthesis [55.30328162764292]
Chemist-X is a comprehensive AI agent that automates the reaction condition optimization (RCO) task in chemical synthesis.<n>The agent uses retrieval-augmented generation (RAG) technology and AI-controlled wet-lab experiment executions.<n>Results of our automatic wet-lab experiments, achieved by fully LLM-supervised end-to-end operation with no human in the lope, prove Chemist-X's ability in self-driving laboratories.
arXiv Detail & Related papers (2023-11-16T01:21:33Z)
MAILS -- Meta AI Literacy Scale: Development and Testing of an AI Literacy Questionnaire Based on Well-Founded Competency Models and Psychological Change- and Meta-Competencies [6.368014180870025]
The questionnaire should be modular (i.e., including different facets that can be used independently of each other) to be flexibly applicable in professional life. We derived 60 items to represent different facets of AI Literacy according to Ng and colleagues conceptualisation of AI literacy. Additional 12 items to represent psychological competencies such as problem solving, learning, and emotion regulation in regard to AI.
arXiv Detail & Related papers (2023-02-18T12:35:55Z)
ChemVise: Maximizing Out-of-Distribution Chemical Detection with the Novel Application of Zero-Shot Learning [60.02503434201552]
This research proposes learning approximations of complex exposures from training sets of simple ones. We demonstrate this approach to synthetic sensor responses surprisingly improves the detection of out-of-distribution obscured chemical analytes.
arXiv Detail & Related papers (2023-02-09T20:19:57Z)
Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics. We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.