Related papers: Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach

Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach

URL: http://arxiv.org/abs/2602.22585v1
Date: Thu, 26 Feb 2026 03:35:36 GMT
Title: Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach
Authors: Jodi M. Casabianca, Maggie Beiting-Parrish,
Abstract summary: This paper integrates psychometric rater models into the AI pipeline to improve the reliability and validity of conclusions drawn from human judgments.<n>We show how adjusting for rater severity produces corrected estimates of summary quality.<n>This perspective highlights a path toward more robust, interpretable, and construct-aligned practices for AI development and evaluation.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Human evaluations play a central role in training and assessing AI models, yet these data are rarely treated as measurements subject to systematic error. This paper integrates psychometric rater models into the AI pipeline to improve the reliability and validity of conclusions drawn from human judgments. The paper reviews common rater effects, severity and centrality, that distort observed ratings, and demonstrates how item response theory rater models, particularly the multi-faceted Rasch model, can separate true output quality from rater behavior. Using the OpenAI summarization dataset as an empirical example, we show how adjusting for rater severity produces corrected estimates of summary quality and provides diagnostic insight into rater performance. Incorporating psychometric modeling into human-in-the-loop evaluation offers more principled and transparent use of human data, enabling developers to make decisions based on adjusted scores rather than raw, error-prone ratings. This perspective highlights a path toward more robust, interpretable, and construct-aligned practices for AI development and evaluation.

Related papers

Markovian ODE-guided scoring can assess the quality of offline reasoning traces in language models [16.178449605148995]
We introduce MarODE, an offline evaluation framework that assigns quality scores to reasoning traces.<n>Its effectiveness is assessed using human-centric perturbations and human judgments.<n>In a large-scale evaluation, MarODE outperforms existing baselines by over 250%.
arXiv Detail & Related papers (2026-03-02T08:09:33Z)
PhyCritic: Multimodal Critic Models for Physical AI [101.37916322714041]
We introduce PhyCritic, a multimodal critic model optimized for physical AI through a two-stage RLVR pipeline.<n>We show that PhyCritic achieves strong performance gains over open-source baselines and, when applied as a policy model, further improves perception and reasoning in physically grounded tasks.
arXiv Detail & Related papers (2026-02-11T18:35:39Z)
ARISE: An Adaptive Resolution-Aware Metric for Test-Time Scaling Evaluation in Large Reasoning Models [102.4511331368587]
ARISE (Adaptive Resolution-aware Scaling Evaluation) is a novel metric designed to assess the test-time scaling effectiveness of large reasoning models.<n>We conduct comprehensive experiments evaluating state-of-the-art reasoning models across diverse domains.
arXiv Detail & Related papers (2025-10-07T15:10:51Z)
Doubly-Robust LLM-as-a-Judge: Externally Valid Estimation with Imperfect Personas [31.16720541398267]
We propose a doubly-robust estimation framework designed to address evaluation sampling bias.<n>Key to our approach is the use of "persona" ratings produced by prompting an evaluator to behave as a human rater.<n>We show that our approach yields valid system quality estimates when either (i) a model trained to predict human ratings using persona ratings and source data observed under sampling bias, or (ii) a reweighting model that corrects for sampling bias is of sufficient quality.
arXiv Detail & Related papers (2025-09-26T21:42:51Z)
The Lessons of Developing Process Reward Models in Mathematical Reasoning [62.165534879284735]
Process Reward Models (PRMs) aim to identify and mitigate intermediate errors in the reasoning processes.<n>We develop a consensus filtering mechanism that effectively integrates Monte Carlo (MC) estimation with Large Language Models (LLMs)<n>We release a new state-of-the-art PRM that outperforms existing open-source alternatives.
arXiv Detail & Related papers (2025-01-13T13:10:16Z)
Learning to Generate and Evaluate Fact-checking Explanations with Transformers [10.970249299147866]
Research contributes to the field of Explainable Artificial Antelligence (XAI) We develop transformer-based fact-checking models that contextualise and justify their decisions by generating human-accessible explanations. We emphasise the need for aligning Artificial Intelligence (AI)-generated explanations with human judgements.
arXiv Detail & Related papers (2024-10-21T06:22:51Z)
Beyond correlation: The Impact of Human Uncertainty in Measuring the Effectiveness of Automatic Evaluation and LLM-as-a-Judge [51.93909886542317]
We show how *relying on a single aggregate correlation score* can obscure fundamental differences between human labels and those from automatic evaluation.<n>We propose stratifying data by human label uncertainty to provide a more robust analysis of automatic evaluation performance.
arXiv Detail & Related papers (2024-10-03T03:08:29Z)
Poor-Supervised Evaluation for SuperLLM via Mutual Consistency [20.138831477848615]
We propose the PoEM framework to conduct evaluation without accurate labels. We first prove that the capability of a model can be equivalently assessed by the consistency between it and certain reference model. To alleviate the insufficiencies of the conditions in reality, we introduce an algorithm that treats humans (when available) and the models under evaluation as reference models.
arXiv Detail & Related papers (2024-08-25T06:49:03Z)
Aligning Model Evaluations with Human Preferences: Mitigating Token Count Bias in Language Model Assessments [2.1370543868467275]
This follow-up paper explores methods to align Large Language Models evaluator preferences with human evaluations. We employed Bayesian statistics and a t-test to quantify this bias and developed a recalibration procedure to adjust the GPTScorer. Our findings significantly improve aligning the recalibrated LLM evaluator with human evaluations across multiple use cases.
arXiv Detail & Related papers (2024-07-05T09:26:40Z)
Self-Training with Pseudo-Label Scorer for Aspect Sentiment Quad Prediction [54.23208041792073]
Aspect Sentiment Quad Prediction (ASQP) aims to predict all quads (aspect term, aspect category, opinion term, sentiment polarity) for a given review. A key challenge in the ASQP task is the scarcity of labeled data, which limits the performance of existing methods. We propose a self-training framework with a pseudo-label scorer, wherein a scorer assesses the match between reviews and their pseudo-labels.
arXiv Detail & Related papers (2024-06-26T05:30:21Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
It HAS to be Subjective: Human Annotator Simulation via Zero-shot Density Estimation [15.8765167340819]
Human annotator simulation (HAS) serves as a cost-effective substitute for human evaluation such as data annotation and system assessment. Human perception and behaviour during human evaluation exhibit inherent variability due to diverse cognitive processes and subjective interpretations. This paper introduces a novel meta-learning framework that treats HAS as a zero-shot density estimation problem.
arXiv Detail & Related papers (2023-09-30T20:54:59Z)
Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.