It HAS to be Subjective: Human Annotator Simulation via Zero-shot
Density Estimation
- URL: http://arxiv.org/abs/2310.00486v1
- Date: Sat, 30 Sep 2023 20:54:59 GMT
- Title: It HAS to be Subjective: Human Annotator Simulation via Zero-shot
Density Estimation
- Authors: Wen Wu, Wenlin Chen, Chao Zhang, Philip C. Woodland
- Abstract summary: Human annotator simulation (HAS) serves as a cost-effective substitute for human evaluation such as data annotation and system assessment.
Human perception and behaviour during human evaluation exhibit inherent variability due to diverse cognitive processes and subjective interpretations.
This paper introduces a novel meta-learning framework that treats HAS as a zero-shot density estimation problem.
- Score: 15.8765167340819
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Human annotator simulation (HAS) serves as a cost-effective substitute for
human evaluation such as data annotation and system assessment. Human
perception and behaviour during human evaluation exhibit inherent variability
due to diverse cognitive processes and subjective interpretations, which should
be taken into account in modelling to better mimic the way people perceive and
interact with the world. This paper introduces a novel meta-learning framework
that treats HAS as a zero-shot density estimation problem, which incorporates
human variability and allows for the efficient generation of human-like
annotations for unlabelled test inputs. Under this framework, we propose two
new model classes, conditional integer flows and conditional softmax flows, to
account for ordinal and categorical annotations, respectively. The proposed
method is evaluated on three real-world human evaluation tasks and shows
superior capability and efficiency to predict the aggregated behaviours of
human annotators, match the distribution of human annotations, and simulate the
inter-annotator disagreements.
Related papers
- ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models [53.00812898384698]
We argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking.
We highlight how cognitive biases can conflate fluent information and truthfulness, and how cognitive uncertainty affects the reliability of rating scores such as Likert.
We propose the ConSiDERS-The-Human evaluation framework consisting of 6 pillars --Consistency, Scoring Critera, Differentiating, User Experience, Responsible, and Scalability.
arXiv Detail & Related papers (2024-05-28T22:45:28Z) - Beyond static AI evaluations: advancing human interaction evaluations for LLM harms and risks [1.3309842610191835]
"Human interaction evaluations" focus on the assessment of human-model interactions.
We propose a safety-focused HIE design framework with three stages.
We conclude with tangible recommendations for addressing concerns over costs, replicability, and unrepresentativeness of HIEs.
arXiv Detail & Related papers (2024-05-17T08:49:34Z) - Offline Risk-sensitive RL with Partial Observability to Enhance
Performance in Human-Robot Teaming [1.3980986259786223]
We propose a method to incorporate model uncertainty, thus enabling risk-sensitive sequential decision-making.
Experiments were conducted with a group of twenty-six human participants within a simulated robot teleoperation environment.
arXiv Detail & Related papers (2024-02-08T14:27:34Z) - AlignDiff: Aligning Diverse Human Preferences via Behavior-Customisable
Diffusion Model [69.12623428463573]
AlignDiff is a novel framework to quantify human preferences, covering abstractness, and guide diffusion planning.
It can accurately match user-customized behaviors and efficiently switch from one to another.
We demonstrate its superior performance on preference matching, switching, and covering compared to other baselines.
arXiv Detail & Related papers (2023-10-03T13:53:08Z) - Dataset Bias in Human Activity Recognition [57.91018542715725]
This contribution statistically curates the training data to assess to what degree the physical characteristics of humans influence HAR performance.
We evaluate the performance of a state-of-the-art convolutional neural network on two HAR datasets that vary in the sensors, activities, and recording for time-series HAR.
arXiv Detail & Related papers (2023-01-19T12:33:50Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - Dynamic Human Evaluation for Relative Model Comparisons [8.843915018287476]
We present a dynamic approach to measure the required number of human annotations when evaluating generated outputs in relative comparison settings.
We propose an agent-based framework of human evaluation to assess multiple labelling strategies and methods to decide the better model in a simulation and a crowdsourcing case study.
arXiv Detail & Related papers (2021-12-15T11:32:13Z) - What I Cannot Predict, I Do Not Understand: A Human-Centered Evaluation
Framework for Explainability Methods [6.232071870655069]
We show that theoretical measures used to score explainability methods poorly reflect the practical usefulness of individual attribution methods in real-world scenarios.
Our results suggest a critical need to develop better explainability methods and to deploy human-centered evaluation approaches.
arXiv Detail & Related papers (2021-12-06T18:36:09Z) - Is Automated Topic Model Evaluation Broken?: The Incoherence of
Coherence [62.826466543958624]
We look at the standardization gap and the validation gap in topic model evaluation.
Recent models relying on neural components surpass classical topic models according to these metrics.
We use automatic coherence along with the two most widely accepted human judgment tasks, namely, topic rating and word intrusion.
arXiv Detail & Related papers (2021-07-05T17:58:52Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.