Related papers: Confusion-Aware Rubric Optimization for LLM-based Automated Grading

Confusion-Aware Rubric Optimization for LLM-based Automated Grading

URL: http://arxiv.org/abs/2603.00451v1
Date: Sat, 28 Feb 2026 04:17:12 GMT
Title: Confusion-Aware Rubric Optimization for LLM-based Automated Grading
Authors: Yucheng Chu, Hang Li, Kaiqi Yang, Yasemin Copur-Gencturk, Joseph Krajcik, Namsoo Shin, Jiliang Tang,
Abstract summary: We introduce Confusion-Aware Optimization (CARO), a novel framework that enhances accuracy and computational efficiency.<n>CARO decomposes monolithic error signals into distinct modes, allowing for unambiguous diagnosis and repair of specific misclassification patterns.<n>These results suggest that replacing mixed-error aggregation with surgical, mode-specific repair yields robust improvements in automated assessment scalability and precision.
Score: 31.353360036776976
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Accurate and unambiguous guidelines are critical for large language model (LLM) based graders, yet manually crafting these prompts is often sub-optimal as LLMs can misinterpret expert guidelines or lack necessary domain specificity. Consequently, the field has moved toward automated prompt optimization to refine grading guidelines without the burden of manual trial and error. However, existing frameworks typically aggregate independent and unstructured error samples into a single update step, resulting in "rule dilution" where conflicting constraints weaken the model's grading logic. To address these limitations, we introduce Confusion-Aware Rubric Optimization (CARO), a novel framework that enhances accuracy and computational efficiency by structurally separating error signals. CARO leverages the confusion matrix to decompose monolithic error signals into distinct modes, allowing for the diagnosis and repair of specific misclassification patterns individually. By synthesizing targeted "fixing patches" for dominant error modes and employing a diversity-aware selection mechanism, the framework prevents guidance conflict and eliminates the need for resource-heavy nested refinement loops. Empirical evaluations on teacher education and STEM datasets demonstrate that CARO significantly outperforms existing SOTA methods. These results suggest that replacing mixed-error aggregation with surgical, mode-specific repair yields robust improvements in automated assessment scalability and precision.

Related papers

MIRROR: A Multi-Agent Framework with Iterative Adaptive Revision and Hierarchical Retrieval for Optimization Modeling in Operations Research [15.28095645151852]
MIRROR is a fine-tuning-free, end-to-end multi-agent framework for operations research.<n>It translates natural language optimization problems into mathematical models and solver code.<n>Experiments show that MIRROR outperforms existing methods on standard Operations Research benchmarks.
arXiv Detail & Related papers (2026-02-03T09:46:56Z)
Automated Optimization Modeling via a Localizable Error-Driven Perspective [20.591721861026414]
We propose a novel error-driven learning framework for automated optimization modeling.<n>MIND customized the whole model training framework from data synthesis to post-training.<n>MIND consistently outperforms all the state-of-the-art automated optimization modeling approaches.
arXiv Detail & Related papers (2026-01-17T09:59:01Z)
Adaptive Learning Guided by Bias-Noise-Alignment Diagnostics [0.7519872646378835]
This paper proposes a diagnostic-driven learning framework that explicitly models error adaptive evolution.<n>These diagnostics are computed online from lightweight statistics of loss or temporal-difference (TD) error trajectories.
arXiv Detail & Related papers (2025-12-30T19:57:52Z)
The Hidden Cost of Approximation in Online Mirror Descent [56.99972253009168]
Online mirror descent (OMD) is a fundamental algorithmic paradigm that underlies many algorithms in optimization, machine learning and sequential decision-making.<n>In this work we initiate a systematic study into inexact OMD, and uncover an intricate relation between regularizer smoothness and robustness to approximation errors.
arXiv Detail & Related papers (2025-11-27T10:09:07Z)
LVLMs as inspectors: an agentic framework for category-level structural defect annotation [3.2445985501669434]
A novel agentic annotation framework, Agent-based Defect Pattern Tagger, is introduced.<n>It integrates Large Vision-Language Models (LVLMs) with a semantic pattern matching module and an iterative self-questioning refinement mechanism.<n>It transforms raw visual data into high-quality, semantically labeled defect datasets without any manual supervision.
arXiv Detail & Related papers (2025-10-01T07:31:42Z)
Model Hemorrhage and the Robustness Limits of Large Language Models [119.46442117681147]
Large language models (LLMs) demonstrate strong performance across natural language processing tasks, yet undergo significant performance degradation when modified for deployment.<n>We define this phenomenon as model hemorrhage - performance decline caused by parameter alterations and architectural changes.
arXiv Detail & Related papers (2025-03-31T10:16:03Z)
Error Classification of Large Language Models on Math Word Problems: A Dynamically Adaptive Framework [79.40678802098026]
Math Word Problems serve as a crucial benchmark for evaluating Large Language Models' reasoning abilities.<n>Current error classification methods rely on static and predefined categories.<n>We propose Error-Aware Prompting (EAP) that incorporates common error patterns as explicit guidance.
arXiv Detail & Related papers (2025-01-26T16:17:57Z)
Self-Healing Machine Learning: A Framework for Autonomous Adaptation in Real-World Environments [50.310636905746975]
Real-world machine learning systems often encounter model performance degradation due to distributional shifts in the underlying data generating process. Existing approaches to addressing shifts, such as concept drift adaptation, are limited by their reason-agnostic nature. We propose self-healing machine learning (SHML) to overcome these limitations.
arXiv Detail & Related papers (2024-10-31T20:05:51Z)
Subtle Errors in Reasoning: Preference Learning via Error-injected Self-editing [59.405145971637204]
We propose a novel preference learning framework called eRror-Injected Self-Editing (RISE)<n>RISE injects predefined subtle errors into pivotal tokens in reasoning or steps to construct hard pairs for error mitigation.<n>Experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH with only 4.5K training samples.
arXiv Detail & Related papers (2024-10-09T07:43:38Z)
InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance [56.184255657175335]
We develop textbfInferAligner, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment. Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics. It significantly diminishes the Attack Success Rate (ASR) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
arXiv Detail & Related papers (2024-01-20T10:41:03Z)
Threshold-Consistent Margin Loss for Open-World Deep Metric Learning [42.03620337000911]
Existing losses used in deep metric learning (DML) for image retrieval often lead to highly non-uniform intra-class and inter-class representation structures. Inconsistency often complicates the threshold selection process when deploying commercial image retrieval systems. We propose a novel variance-based metric called Operating-Point-Inconsistency-Score (OPIS) that quantifies the variance in the operating characteristics across classes.
arXiv Detail & Related papers (2023-07-08T21:16:41Z)
Learning Prompt-Enhanced Context Features for Weakly-Supervised Video Anomaly Detection [37.99031842449251]
Video anomaly detection under weak supervision presents significant challenges. We present a weakly supervised anomaly detection framework that focuses on efficient context modeling and enhanced semantic discriminability. Our approach significantly improves the detection accuracy of certain anomaly sub-classes, underscoring its practical value and efficacy.
arXiv Detail & Related papers (2023-06-26T06:45:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.