Using Natural Language Explanations to Rescale Human Judgments
- URL: http://arxiv.org/abs/2305.14770v5
- Date: Mon, 9 Sep 2024 17:37:16 GMT
- Title: Using Natural Language Explanations to Rescale Human Judgments
- Authors: Manya Wadhwa, Jifan Chen, Junyi Jessy Li, Greg Durrett,
- Abstract summary: We propose a method to rescale ordinal annotations and explanations using large language models (LLMs)
We feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric.
Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.
- Score: 81.66697572357477
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rise of large language models (LLMs) has brought a critical need for high-quality human-labeled data, particularly for processes like human feedback and evaluation. A common practice is to label data via consensus annotation over human judgments. However, annotators' judgments for subjective tasks can differ in many ways: they may reflect different qualitative judgments about an example, and they may be mapped to a labeling scheme in different ways. We show that these nuances can be captured by natural language explanations, and propose a method to rescale ordinal annotations and explanations using LLMs. Specifically, we feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric. These scores should reflect the annotators' underlying assessments of the example. The rubric can be designed or modified after annotation, and include distinctions that may not have been known when the original error taxonomy was devised. We explore our technique in the context of rating system outputs for a document-grounded question answering task, where LLMs achieve near-human performance. Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.
Related papers
- Can LLMs Evaluate What They Cannot Annotate? Revisiting LLM Reliability in Hate Speech Detection [5.731621080995591]
Hate speech spreads widely online, harming individuals and communities, making automatic detection essential for large-scale moderation.<n>Part of the challenge lies in subjectivity: what one person flags as hate speech, another may see as benign.<n>Large Language Models (LLMs) promise scalable annotation, but prior studies demonstrate that they cannot fully replace human judgement.
arXiv Detail & Related papers (2025-12-10T14:00:48Z) - Can We Hide Machines in the Crowd? Quantifying Equivalence in LLM-in-the-loop Annotation Tasks [8.246529401043128]
We aim to explore how labeling decisions -- by both humans and LLMs -- can be statistically evaluated across individuals.<n>We develop a statistical evaluation method based on Krippendorff's $alpha$, paired bootstrapping, and the Two One-Sided t-Tests (TOST) equivalence test procedure.<n>We apply this approach to two datasets -- MovieLens 100K and PolitiFact -- and find that the LLM is statistically indistinguishable from a human annotator in the former.
arXiv Detail & Related papers (2025-10-08T05:17:33Z) - Human-Centered Evaluation of RAG outputs: a framework and questionnaire for human-AI collaboration [0.0]
We designed a human-centred questionnaire that assesses RAG outputs across 12 dimensions.<n>We incorporated feedback from both a human rater and a human-LLM pair.<n>Results indicate that while large language models (LLMs) reliably focus on metric descriptions and scale labels, they exhibit weaknesses in detecting textual format variations.
arXiv Detail & Related papers (2025-09-30T13:08:33Z) - Bridging the Gap: In-Context Learning for Modeling Human Disagreement [8.011316959982654]
Large Language Models (LLMs) have shown strong performance on NLP classification tasks.<n>This study examines whether LLMs can capture multiple perspectives and reflect annotator disagreement in subjective tasks such as hate speech and offensive language detection.
arXiv Detail & Related papers (2025-06-06T14:24:29Z) - Threading the Needle: Reweaving Chain-of-Thought Reasoning to Explain Human Label Variation [60.18907916989796]
Large Language Models (LLMs) generate chains of thought (CoTs) before giving the final answer.<n>We propose a novel pipeline enriched with linguistically-grounded discourse segmenters to extract supporting and opposing statements for each answer option.<n>We also propose a rank-based HLV evaluation framework that prioritizes the ranking of answers over exact scores.
arXiv Detail & Related papers (2025-05-29T11:47:18Z) - SedarEval: Automated Evaluation using Self-Adaptive Rubrics [4.97150240417381]
We propose a new evaluation paradigm based on self-adaptive rubrics.
SedarEval consists of 1,000 meticulously crafted questions, each with its own self-adaptive rubric.
We train a specialized evaluator language model (evaluator LM) to supplant human graders.
arXiv Detail & Related papers (2025-01-26T16:45:09Z) - Towards Understanding the Robustness of LLM-based Evaluations under Perturbations [9.944512689015998]
Large Language Models (LLMs) can serve as automatic evaluators for non-standardized metrics in summarization and dialog-based tasks.
We conduct experiments across multiple prompting strategies to examine how LLMs fare as quality evaluators when compared with human judgments.
arXiv Detail & Related papers (2024-12-12T13:31:58Z) - Diverging Preferences: When do Annotators Disagree and do Models Know? [92.24651142187989]
We develop a taxonomy of disagreement sources spanning 10 categories across four high-level classes.
We find that the majority of disagreements are in opposition with standard reward modeling approaches.
We develop methods for identifying diverging preferences to mitigate their influence on evaluation and training.
arXiv Detail & Related papers (2024-10-18T17:32:22Z) - Comparing zero-shot self-explanations with human rationales in multilingual text classification [5.32539007352208]
Instruction-tuned LLMs generate self-explanations that do not require computations or the application of possibly complex XAI methods.
We analyse whether this ability results in a good explanation by evaluating self-explanations in the form of input rationales.
Our results show that self-explanations align more closely with human annotations compared to LRP, while maintaining a comparable level of faithfulness.
arXiv Detail & Related papers (2024-10-04T10:14:12Z) - Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions [18.93335792080899]
We investigate how much influence prompting the LLMs-as-a-judge has on the alignment of AI judgements to human judgements.
We aggregate a taxonomy of quality criteria commonly used across state-of-the-art evaluations with LLMs and provide this as a rigorous benchmark of models as judges.
arXiv Detail & Related papers (2024-08-16T14:49:35Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged.
In this paper, we study if there are any deficiencies in reference-free metrics.
We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z) - Capturing Perspectives of Crowdsourced Annotators in Subjective Learning Tasks [9.110872603799839]
Supervised classification heavily depends on datasets annotated by humans.
In subjective tasks such as toxicity classification, these annotations often exhibit low agreement among raters.
In this work, we propose textbfAnnotator Awares for Texts (AART) for subjective classification tasks.
arXiv Detail & Related papers (2023-11-16T10:18:32Z) - Peering Through Preferences: Unraveling Feedback Acquisition for
Aligning Large Language Models [32.843361525236965]
We analyze the effect of sparse feedback on the alignment and evaluation of large language models.
We find that preferences from ratings and rankings significantly disagree 60% for both human and AI annotators.
Our findings shed light on critical gaps in methods for evaluating the real-world utility of language models.
arXiv Detail & Related papers (2023-08-30T07:35:32Z) - Perspectives on Large Language Models for Relevance Judgment [56.935731584323996]
Large language models (LLMs) claim that they can assist with relevance judgments.
It is not clear whether automated judgments can reliably be used in evaluations of retrieval systems.
arXiv Detail & Related papers (2023-04-13T13:08:38Z) - Re-Examining Human Annotations for Interpretable NLP [80.81532239566992]
We conduct controlled experiments using crowd-sourced websites on two widely used datasets in Interpretable NLP.
We compare the annotation results obtained from recruiting workers satisfying different levels of qualification.
Our results reveal that the annotation quality is highly subject to the workers' qualification, and workers can be guided to provide certain annotations by the instructions.
arXiv Detail & Related papers (2022-04-10T02:27:30Z) - Elephant in the Room: An Evaluation Framework for Assessing Adversarial
Examples in NLP [24.661335236627053]
An adversarial example is an input transformed by small perturbations that machine learning models consistently misclassify.
We propose an evaluation framework consisting of automatic evaluation metrics and human evaluation guidelines.
arXiv Detail & Related papers (2020-01-22T00:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.