"All that Glitters": Approaches to Evaluations with Unreliable Model and Human Annotations
- URL: http://arxiv.org/abs/2411.15634v1
- Date: Sat, 23 Nov 2024 19:18:08 GMT
- Title: "All that Glitters": Approaches to Evaluations with Unreliable Model and Human Annotations
- Authors: Michael Hardy,
- Abstract summary: "Gold" and "ground truth" human-mediated labels have error.
This study demonstrates methods for answering such questions even in the context of very low reliabilities from expert humans.
- Score: 0.0
- License:
- Abstract: "Gold" and "ground truth" human-mediated labels have error. The effects of this error can escape commonly reported metrics of label quality or obscure questions of accuracy, bias, fairness, and usefulness during model evaluation. This study demonstrates methods for answering such questions even in the context of very low reliabilities from expert humans. We analyze human labels, GPT model ratings, and transformer encoder model annotations describing the quality of classroom teaching, an important, expensive, and currently only human task. We answer the question of whether such a task can be automated using two Large Language Model (LLM) architecture families--encoders and GPT decoders, using novel approaches to evaluating label quality across six dimensions: Concordance, Confidence, Validity, Bias, Fairness, and Helpfulness. First, we demonstrate that using standard metrics in the presence of poor labels can mask both label and model quality: the encoder family of models achieve state-of-the-art, even "super-human", results across all classroom annotation tasks. But not all these positive results remain after using more rigorous evaluation measures which reveal spurious correlations and nonrandom racial biases across models and humans. This study then expands these methods to estimate how model use would change to human label quality if models were used in a human-in-the-loop context, finding that the variance captured in GPT model labels would worsen reliabilities for humans influenced by these models. We identify areas where some LLMs, within the generalizability of the current data, could improve the quality of expensive human ratings of classroom instruction.
Related papers
- Balancing Label Quantity and Quality for Scalable Elicitation [2.2143065226946423]
We study the microeconomics of the quantity-quality tradeoff on binary NLP classification tasks.
We observe three regimes of eliciting classification knowledge from pretrained models using supervised finetuning.
We find that the accuracy of supervised fine-tuning can be improved by up to 5 percentage points at a fixed labeling budget.
arXiv Detail & Related papers (2024-10-17T04:39:58Z) - Beyond correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and LLM-as-a-judge [51.93909886542317]
We show how a single aggregate correlation score can obscure differences between human behavior and automatic evaluation methods.
We propose stratifying results by human label uncertainty to provide a more robust analysis of automatic evaluation performance.
arXiv Detail & Related papers (2024-10-03T03:08:29Z) - Evaluating Mathematical Reasoning Beyond Accuracy [50.09931172314218]
We introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps.
We show that ReasonEval achieves state-of-the-art performance on human-labeled datasets.
We observe that ReasonEval can play a significant role in data selection.
arXiv Detail & Related papers (2024-04-08T17:18:04Z) - Estimating label quality and errors in semantic segmentation data via
any model [19.84626033109009]
We study methods to score label quality, such that the images with the lowest scores are least likely to be correctly labeled.
This helps prioritize what data to review in order to ensure a high-quality training/evaluation dataset.
arXiv Detail & Related papers (2023-07-11T07:29:09Z) - Bring Your Own Data! Self-Supervised Evaluation for Large Language
Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs)
We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence.
We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z) - Label-Retrieval-Augmented Diffusion Models for Learning from Noisy
Labels [61.97359362447732]
Learning from noisy labels is an important and long-standing problem in machine learning for real applications.
In this paper, we reformulate the label-noise problem from a generative-model perspective.
Our model achieves new state-of-the-art (SOTA) results on all the standard real-world benchmark datasets.
arXiv Detail & Related papers (2023-05-31T03:01:36Z) - Gender Biases in Automatic Evaluation Metrics for Image Captioning [87.15170977240643]
We conduct a systematic study of gender biases in model-based evaluation metrics for image captioning tasks.
We demonstrate the negative consequences of using these biased metrics, including the inability to differentiate between biased and unbiased generations.
We present a simple and effective way to mitigate the metric bias without hurting the correlations with human judgments.
arXiv Detail & Related papers (2023-05-24T04:27:40Z) - Confidence-Aware Calibration and Scoring Functions for Curriculum
Learning [1.192436948211501]
We integrate notions of model confidence and human confidence with label smoothing to achieve better model calibration and generalization.
A higher model or human confidence score indicates a more recognisable and therefore easier sample, and can therefore be used as a scoring function to rank samples in curriculum learning.
arXiv Detail & Related papers (2023-01-29T23:59:40Z) - SeedBERT: Recovering Annotator Rating Distributions from an Aggregated
Label [43.23903984174963]
We propose SeedBERT, a method for recovering annotator rating distributions from a single label.
Our human evaluations indicate that SeedBERT's attention mechanism is consistent with human sources of annotator disagreement.
arXiv Detail & Related papers (2022-11-23T18:35:15Z) - An Empirical Investigation of Learning from Biased Toxicity Labels [15.822714574671412]
We study how different training strategies can leverage a small dataset of human-annotated labels and a large but noisy dataset of synthetically generated labels.
We evaluate the accuracy and fairness properties of these approaches, and trade-offs between the two.
arXiv Detail & Related papers (2021-10-04T17:19:57Z) - Towards Good Practices for Efficiently Annotating Large-Scale Image
Classification Datasets [90.61266099147053]
We investigate efficient annotation strategies for collecting multi-class classification labels for a large collection of images.
We propose modifications and best practices aimed at minimizing human labeling effort.
Simulated experiments on a 125k image subset of the ImageNet100 show that it can be annotated to 80% top-1 accuracy with 0.35 annotations per image on average.
arXiv Detail & Related papers (2021-04-26T16:29:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.