Rater Equivalence: Evaluating Classifiers in Human Judgment Settings
- URL: http://arxiv.org/abs/2106.01254v2
- Date: Thu, 06 Nov 2025 16:52:50 GMT
- Title: Rater Equivalence: Evaluating Classifiers in Human Judgment Settings
- Authors: Paul Resnick, Yuqing Kong, Grant Schoenebeck, Tim Weninger,
- Abstract summary: We introduce a framework for evaluating automated classifiers based solely on human judgments.<n>Our framework uses human-generated labels both to construct benchmark panels and to evaluate performance.<n>Using case studies and formal analysis, we demonstrate how this framework can inform the evaluation and deployment of AI systems.
- Score: 11.529701822081394
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In many decision settings, the definitive ground truth is either non-existent or inaccessible. We introduce a framework for evaluating classifiers based solely on human judgments. In such cases, it is helpful to compare automated classifiers to human judgment. We quantify a classifier's performance by its rater equivalence: the smallest number of human raters whose combined judgment matches the classifier's performance. Our framework uses human-generated labels both to construct benchmark panels and to evaluate performance. We distinguish between two models of utility: one based on agreement with the assumed but inaccessible ground truth, and one based on matching individual human judgments. Using case studies and formal analysis, we demonstrate how this framework can inform the evaluation and deployment of AI systems in practice.
Related papers
- EigenBench: A Comparative Behavioral Measure of Value Alignment [0.28707625120094377]
EigenBench is a black-box method for benchmarking language models' values.<n>It is designed to quantify subjective traits for which reasonable judges may disagree on the correct label.<n>It can recover model rankings on the GPQA benchmark without access to objective labels.
arXiv Detail & Related papers (2025-09-02T04:14:26Z) - SEOE: A Scalable and Reliable Semantic Evaluation Framework for Open Domain Event Detection [70.23196257213829]
We propose a scalable and reliable Semantic-level Evaluation framework for Open domain Event detection.<n>Our proposed framework first constructs a scalable evaluation benchmark that currently includes 564 event types covering 7 major domains.<n>We then leverage large language models (LLMs) as automatic evaluation agents to compute a semantic F1-score, incorporating fine-grained definitions of semantically similar labels.
arXiv Detail & Related papers (2025-03-05T09:37:05Z) - HREF: Human Response-Guided Evaluation of Instruction Following in Language Models [61.273153125847166]
We develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF)<n>In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination.<n>We study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template.
arXiv Detail & Related papers (2024-12-20T03:26:47Z) - The Tile: A 2D Map of Ranking Scores for Two-Class Classification [10.89980029564174]
We present a novel versatile tool, named the Tile, that organizes an infinity of ranking scores in a single 2D map for two-class classifiers.
We study the properties of the underlying ranking scores, such as the influence of the priors or the correspondences with the ROC space.
arXiv Detail & Related papers (2024-12-05T16:27:59Z) - Beyond correlation: The Impact of Human Uncertainty in Measuring the Effectiveness of Automatic Evaluation and LLM-as-a-Judge [51.93909886542317]
We show how *relying on a single aggregate correlation score* can obscure fundamental differences between human labels and those from automatic evaluation.
We propose stratifying data by human label uncertainty to provide a more robust analysis of automatic evaluation performance.
arXiv Detail & Related papers (2024-10-03T03:08:29Z) - Compare without Despair: Reliable Preference Evaluation with Generation Separability [20.50638483427141]
We introduce a measure, separability, which estimates how suitable a test instance is for pairwise preference evaluation.
For a candidate test instance, separability samples multiple generations from a pair of models, and measures how distinguishable the two sets of generations are.
Experiments show that instances with high separability values yield more consistent preference ratings from both human- and auto-raters.
arXiv Detail & Related papers (2024-07-02T01:37:56Z) - A structured regression approach for evaluating model performance across intersectional subgroups [53.91682617836498]
Disaggregated evaluation is a central task in AI fairness assessment, where the goal is to measure an AI system's performance across different subgroups.
We introduce a structured regression approach to disaggregated evaluation that we demonstrate can yield reliable system performance estimates even for very small subgroups.
arXiv Detail & Related papers (2024-01-26T14:21:45Z) - Capturing Perspectives of Crowdsourced Annotators in Subjective Learning Tasks [9.110872603799839]
Supervised classification heavily depends on datasets annotated by humans.
In subjective tasks such as toxicity classification, these annotations often exhibit low agreement among raters.
In this work, we propose textbfAnnotator Awares for Texts (AART) for subjective classification tasks.
arXiv Detail & Related papers (2023-11-16T10:18:32Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z) - Using Natural Language Explanations to Rescale Human Judgments [81.66697572357477]
We propose a method to rescale ordinal annotations and explanations using large language models (LLMs)
We feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric.
Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.
arXiv Detail & Related papers (2023-05-24T06:19:14Z) - SeedBERT: Recovering Annotator Rating Distributions from an Aggregated
Label [43.23903984174963]
We propose SeedBERT, a method for recovering annotator rating distributions from a single label.
Our human evaluations indicate that SeedBERT's attention mechanism is consistent with human sources of annotator disagreement.
arXiv Detail & Related papers (2022-11-23T18:35:15Z) - Language Model Classifier Aligns Better with Physician Word Sensitivity
than XGBoost on Readmission Prediction [86.15787587540132]
We introduce sensitivity score, a metric that scrutinizes models' behaviors at the vocabulary level.
Our experiments compare the decision-making logic of clinicians and classifiers based on rank correlations of sensitivity scores.
arXiv Detail & Related papers (2022-11-13T23:59:11Z) - Towards Human-Centred Explainability Benchmarks For Text Classification [4.393754160527062]
We propose to extend text classification benchmarks to evaluate the explainability of text classifiers.
We review challenges associated with objectively evaluating the capabilities to produce valid explanations.
We propose to ground these benchmarks in human-centred applications.
arXiv Detail & Related papers (2022-11-10T09:52:31Z) - Enabling Classifiers to Make Judgements Explicitly Aligned with Human
Values [73.82043713141142]
Many NLP classification tasks, such as sexism/racism detection or toxicity detection, are based on human values.
We introduce a framework for value-aligned classification that performs prediction based on explicitly written human values in the command.
arXiv Detail & Related papers (2022-10-14T09:10:49Z) - Estimating Confidence of Predictions of Individual Classifiers and Their
Ensembles for the Genre Classification Task [0.0]
Genre identification is a subclass of non-topical text classification.
Nerve models based on pre-trained transformers, such as BERT or XLM-RoBERTa, demonstrate SOTA results in many NLP tasks.
arXiv Detail & Related papers (2022-06-15T09:59:05Z) - Measuring Fairness of Text Classifiers via Prediction Sensitivity [63.56554964580627]
ACCUMULATED PREDICTION SENSITIVITY measures fairness in machine learning models based on the model's prediction sensitivity to perturbations in input features.
We show that the metric can be theoretically linked with a specific notion of group fairness (statistical parity) and individual fairness.
arXiv Detail & Related papers (2022-03-16T15:00:33Z) - SEPP: Similarity Estimation of Predicted Probabilities for Defending and
Detecting Adversarial Text [0.0]
We propose an ensemble model based on similarity estimation of predicted probabilities (SEPP) to exploit the large gaps in the misclassified predictions.
We demonstrate the resilience of SEPP in defending and detecting adversarial texts through different types of victim classifiers.
arXiv Detail & Related papers (2021-10-12T05:36:54Z) - Specialists Outperform Generalists in Ensemble Classification [15.315432841707736]
In this paper, we address the question of whether we can determine the accuracy of the ensemble.
We explicitly construct the individual classifiers that attain the upper and lower bounds: specialists and generalists.
arXiv Detail & Related papers (2021-07-09T12:16:10Z) - Enriching ImageNet with Human Similarity Judgments and Psychological
Embeddings [7.6146285961466]
We introduce a dataset that embodies the task-general capabilities of human perception and reasoning.
The Human Similarity Judgments extension to ImageNet (ImageNet-HSJ) is composed of human similarity judgments.
The new dataset supports a range of task and performance metrics, including the evaluation of unsupervised learning algorithms.
arXiv Detail & Related papers (2020-11-22T13:41:54Z) - Learning and Evaluating Representations for Deep One-class
Classification [59.095144932794646]
We present a two-stage framework for deep one-class classification.
We first learn self-supervised representations from one-class data, and then build one-class classifiers on learned representations.
In experiments, we demonstrate state-of-the-art performance on visual domain one-class classification benchmarks.
arXiv Detail & Related papers (2020-11-04T23:33:41Z) - Tweet Sentiment Quantification: An Experimental Re-Evaluation [88.60021378715636]
Sentiment quantification is the task of training, by means of supervised learning, estimators of the relative frequency (also called prevalence'') of sentiment-related classes.
We re-evaluate those quantification methods following a now consolidated and much more robust experimental protocol.
Results are dramatically different from those obtained by Gao Gao Sebastiani, and they provide a different, much more solid understanding of the relative strengths and weaknesses of different sentiment quantification methods.
arXiv Detail & Related papers (2020-11-04T21:41:34Z) - Constructing interval variables via faceted Rasch measurement and
multitask deep learning: a hate speech application [63.10266319378212]
We propose a method for measuring complex variables on a continuous, interval spectrum by combining supervised deep learning with the Constructing Measures approach to faceted Rasch item response theory (IRT)
We demonstrate this new method on a dataset of 50,000 social media comments sourced from YouTube, Twitter, and Reddit and labeled by 11,000 U.S.-based Amazon Mechanical Turk workers.
arXiv Detail & Related papers (2020-09-22T02:15:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.