Related papers: It HAS to be Subjective: Human Annotator Simulation via Zero-shot Density Estimation

It HAS to be Subjective: Human Annotator Simulation via Zero-shot Density Estimation

URL: http://arxiv.org/abs/2310.00486v1
Date: Sat, 30 Sep 2023 20:54:59 GMT
Title: It HAS to be Subjective: Human Annotator Simulation via Zero-shot Density Estimation
Authors: Wen Wu, Wenlin Chen, Chao Zhang, Philip C. Woodland
Abstract summary: Human annotator simulation (HAS) serves as a cost-effective substitute for human evaluation such as data annotation and system assessment. Human perception and behaviour during human evaluation exhibit inherent variability due to diverse cognitive processes and subjective interpretations. This paper introduces a novel meta-learning framework that treats HAS as a zero-shot density estimation problem.
Score: 15.8765167340819
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Human annotator simulation (HAS) serves as a cost-effective substitute for human evaluation such as data annotation and system assessment. Human perception and behaviour during human evaluation exhibit inherent variability due to diverse cognitive processes and subjective interpretations, which should be taken into account in modelling to better mimic the way people perceive and interact with the world. This paper introduces a novel meta-learning framework that treats HAS as a zero-shot density estimation problem, which incorporates human variability and allows for the efficient generation of human-like annotations for unlabelled test inputs. Under this framework, we propose two new model classes, conditional integer flows and conditional softmax flows, to account for ordinal and categorical annotations, respectively. The proposed method is evaluated on three real-world human evaluation tasks and shows superior capability and efficiency to predict the aggregated behaviours of human annotators, match the distribution of human annotations, and simulate the inter-annotator disagreements.

Related papers

Monocle: Hybrid Local-Global In-Context Evaluation for Long-Text Generation with Uncertainty-Based Active Learning [63.531262595858]
Divide-and-conquer approach breaks comprehensive evaluation task into localized scoring tasks, followed by a final global assessment.<n>We introduce a hybrid in-context learning approach that leverages human annotations to enhance the performance of both local and global evaluations.<n>Finally, we develop an uncertainty-based active learning algorithm that efficiently selects data samples for human annotation.
arXiv Detail & Related papers (2025-05-26T16:39:41Z)
B-RIGHT: Benchmark Re-evaluation for Integrity in Generalized Human-Object Interaction Testing [18.822653709976784]
Human-object interaction (HOI) is an essential problem in artificial intelligence (AI) Current benchmarks such as HICO-DET face the following limitations. We propose a new class-balanced dataset, Benchmark Re-evaluation for Integrity in Generalized Human-object Interaction Testing (B-RIGHT)
arXiv Detail & Related papers (2025-01-28T06:04:08Z)
HREF: Human Response-Guided Evaluation of Instruction Following in Language Models [61.273153125847166]
We develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF) In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination. We study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template.
arXiv Detail & Related papers (2024-12-20T03:26:47Z)
Towards Unifying Evaluation of Counterfactual Explanations: Leveraging Large Language Models for Human-Centric Assessments [0.7852714805965528]
We develop a set of 30 counterfactual scenarios and collect ratings across 8 evaluation metrics from 206 respondents. We fine-tuned different Large Language Models to predict average or individual human judgment across these metrics.
arXiv Detail & Related papers (2024-10-28T15:33:37Z)
Beyond correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and LLM-as-a-judge [51.93909886542317]
We show how a single aggregate correlation score can obscure differences between human behavior and automatic evaluation methods. We propose stratifying results by human label uncertainty to provide a more robust analysis of automatic evaluation performance.
arXiv Detail & Related papers (2024-10-03T03:08:29Z)
Poor-Supervised Evaluation for SuperLLM via Mutual Consistency [20.138831477848615]
We propose the PoEM framework to conduct evaluation without accurate labels. We first prove that the capability of a model can be equivalently assessed by the consistency between it and certain reference model. To alleviate the insufficiencies of the conditions in reality, we introduce an algorithm that treats humans (when available) and the models under evaluation as reference models.
arXiv Detail & Related papers (2024-08-25T06:49:03Z)
ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models [53.00812898384698]
We argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking. We highlight how cognitive biases can conflate fluent information and truthfulness, and how cognitive uncertainty affects the reliability of rating scores such as Likert. We propose the ConSiDERS-The-Human evaluation framework consisting of 6 pillars -- Consistency, Scoring Criteria, Differentiating, User Experience, Responsible, and Scalability.
arXiv Detail & Related papers (2024-05-28T22:45:28Z)
Offline Risk-sensitive RL with Partial Observability to Enhance Performance in Human-Robot Teaming [1.3980986259786223]
We propose a method to incorporate model uncertainty, thus enabling risk-sensitive sequential decision-making. Experiments were conducted with a group of twenty-six human participants within a simulated robot teleoperation environment.
arXiv Detail & Related papers (2024-02-08T14:27:34Z)
AlignDiff: Aligning Diverse Human Preferences via Behavior-Customisable Diffusion Model [69.12623428463573]
AlignDiff is a novel framework to quantify human preferences, covering abstractness, and guide diffusion planning. It can accurately match user-customized behaviors and efficiently switch from one to another. We demonstrate its superior performance on preference matching, switching, and covering compared to other baselines.
arXiv Detail & Related papers (2023-10-03T13:53:08Z)
From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing. This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time. We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale. We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units. We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z)
Dynamic Human Evaluation for Relative Model Comparisons [8.843915018287476]
We present a dynamic approach to measure the required number of human annotations when evaluating generated outputs in relative comparison settings. We propose an agent-based framework of human evaluation to assess multiple labelling strategies and methods to decide the better model in a simulation and a crowdsourcing case study.
arXiv Detail & Related papers (2021-12-15T11:32:13Z)
Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation. Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.