It HAS to be Subjective: Human Annotator Simulation via Zero-shot
Density Estimation
- URL: http://arxiv.org/abs/2310.00486v1
- Date: Sat, 30 Sep 2023 20:54:59 GMT
- Title: It HAS to be Subjective: Human Annotator Simulation via Zero-shot
Density Estimation
- Authors: Wen Wu, Wenlin Chen, Chao Zhang, Philip C. Woodland
- Abstract summary: Human annotator simulation (HAS) serves as a cost-effective substitute for human evaluation such as data annotation and system assessment.
Human perception and behaviour during human evaluation exhibit inherent variability due to diverse cognitive processes and subjective interpretations.
This paper introduces a novel meta-learning framework that treats HAS as a zero-shot density estimation problem.
- Score: 15.8765167340819
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Human annotator simulation (HAS) serves as a cost-effective substitute for
human evaluation such as data annotation and system assessment. Human
perception and behaviour during human evaluation exhibit inherent variability
due to diverse cognitive processes and subjective interpretations, which should
be taken into account in modelling to better mimic the way people perceive and
interact with the world. This paper introduces a novel meta-learning framework
that treats HAS as a zero-shot density estimation problem, which incorporates
human variability and allows for the efficient generation of human-like
annotations for unlabelled test inputs. Under this framework, we propose two
new model classes, conditional integer flows and conditional softmax flows, to
account for ordinal and categorical annotations, respectively. The proposed
method is evaluated on three real-world human evaluation tasks and shows
superior capability and efficiency to predict the aggregated behaviours of
human annotators, match the distribution of human annotations, and simulate the
inter-annotator disagreements.
Related papers
- Towards Unifying Evaluation of Counterfactual Explanations: Leveraging Large Language Models for Human-Centric Assessments [0.7852714805965528]
We develop a set of 30 counterfactual scenarios and collect ratings across 8 evaluation metrics from 206 respondents.
We fine-tuned different Large Language Models to predict average or individual human judgment across these metrics.
arXiv Detail & Related papers (2024-10-28T15:33:37Z) - Beyond correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and LLM-as-a-judge [51.93909886542317]
We show how *relying on a single aggregate correlation score* can obscure fundamental differences between human behavior and automatic evaluation methods.
We propose stratifying results by human label uncertainty to provide a more robust analysis of automatic evaluation performance.
arXiv Detail & Related papers (2024-10-03T03:08:29Z) - Poor-Supervised Evaluation for SuperLLM via Mutual Consistency [20.138831477848615]
We propose the PoEM framework to conduct evaluation without accurate labels.
We first prove that the capability of a model can be equivalently assessed by the consistency between it and certain reference model.
To alleviate the insufficiencies of the conditions in reality, we introduce an algorithm that treats humans (when available) and the models under evaluation as reference models.
arXiv Detail & Related papers (2024-08-25T06:49:03Z) - ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models [53.00812898384698]
We argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking.
We highlight how cognitive biases can conflate fluent information and truthfulness, and how cognitive uncertainty affects the reliability of rating scores such as Likert.
We propose the ConSiDERS-The-Human evaluation framework consisting of 6 pillars -- Consistency, Scoring Criteria, Differentiating, User Experience, Responsible, and Scalability.
arXiv Detail & Related papers (2024-05-28T22:45:28Z) - Offline Risk-sensitive RL with Partial Observability to Enhance
Performance in Human-Robot Teaming [1.3980986259786223]
We propose a method to incorporate model uncertainty, thus enabling risk-sensitive sequential decision-making.
Experiments were conducted with a group of twenty-six human participants within a simulated robot teleoperation environment.
arXiv Detail & Related papers (2024-02-08T14:27:34Z) - AlignDiff: Aligning Diverse Human Preferences via Behavior-Customisable
Diffusion Model [69.12623428463573]
AlignDiff is a novel framework to quantify human preferences, covering abstractness, and guide diffusion planning.
It can accurately match user-customized behaviors and efficiently switch from one to another.
We demonstrate its superior performance on preference matching, switching, and covering compared to other baselines.
arXiv Detail & Related papers (2023-10-03T13:53:08Z) - From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing.
This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time.
We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - Dynamic Human Evaluation for Relative Model Comparisons [8.843915018287476]
We present a dynamic approach to measure the required number of human annotations when evaluating generated outputs in relative comparison settings.
We propose an agent-based framework of human evaluation to assess multiple labelling strategies and methods to decide the better model in a simulation and a crowdsourcing case study.
arXiv Detail & Related papers (2021-12-15T11:32:13Z) - What I Cannot Predict, I Do Not Understand: A Human-Centered Evaluation
Framework for Explainability Methods [6.232071870655069]
We show that theoretical measures used to score explainability methods poorly reflect the practical usefulness of individual attribution methods in real-world scenarios.
Our results suggest a critical need to develop better explainability methods and to deploy human-centered evaluation approaches.
arXiv Detail & Related papers (2021-12-06T18:36:09Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.