Which Prompts Make The Difference? Data Prioritization For Efficient
Human LLM Evaluation
- URL: http://arxiv.org/abs/2310.14424v1
- Date: Sun, 22 Oct 2023 21:48:51 GMT
- Title: Which Prompts Make The Difference? Data Prioritization For Efficient
Human LLM Evaluation
- Authors: Meriem Boubdir, Edward Kim, Beyza Ermis, Marzieh Fadaee, Sara Hooker
- Abstract summary: We find that metric-based methods enhance the efficiency of human evaluations by minimizing the number of required annotations.
We show that our method is effective across widely used model families, reducing instances of indecisive (or "tie") outcomes by up to 54%.
This potential reduction in required human effort positions our approach as a valuable strategy in future large language model evaluations.
- Score: 9.452326973655445
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human evaluation is increasingly critical for assessing large language
models, capturing linguistic nuances, and reflecting user preferences more
accurately than traditional automated metrics. However, the resource-intensive
nature of this type of annotation process poses significant challenges. The key
question driving our work: "is it feasible to minimize human-in-the-loop
feedback by prioritizing data instances which most effectively distinguish
between models?" We evaluate several metric-based methods and find that these
metrics enhance the efficiency of human evaluations by minimizing the number of
required annotations, thus saving time and cost, while ensuring a robust
performance evaluation. We show that our method is effective across widely used
model families, reducing instances of indecisive (or "tie") outcomes by up to
54% compared to a random sample when focusing on the top-20 percentile of
prioritized instances. This potential reduction in required human effort
positions our approach as a valuable strategy in future large language model
evaluations.
Related papers
- Towards Unifying Evaluation of Counterfactual Explanations: Leveraging Large Language Models for Human-Centric Assessments [0.7852714805965528]
We develop a set of 30 counterfactual scenarios and collect ratings across 8 evaluation metrics from 206 respondents.
We fine-tuned different Large Language Models to predict average or individual human judgment across these metrics.
arXiv Detail & Related papers (2024-10-28T15:33:37Z) - Aligning Model Evaluations with Human Preferences: Mitigating Token Count Bias in Language Model Assessments [2.1370543868467275]
This follow-up paper explores methods to align Large Language Models evaluator preferences with human evaluations.
We employed Bayesian statistics and a t-test to quantify this bias and developed a recalibration procedure to adjust the GPTScorer.
Our findings significantly improve aligning the recalibrated LLM evaluator with human evaluations across multiple use cases.
arXiv Detail & Related papers (2024-07-05T09:26:40Z) - HARE: HumAn pRiors, a key to small language model Efficiency [6.253561984966316]
Human priors play a crucial role in efficiently utilizing data in deep learning.
Existing Small Language Models mainly rely on web-scraped large-scale training data.
We propose a principle to leverage human priors for data construction.
arXiv Detail & Related papers (2024-06-17T10:56:03Z) - Annotator-Centric Active Learning for Subjective NLP Tasks [7.766754308448708]
Active Learning (AL) addresses the high costs of collecting human annotations by strategically annotating the most informative samples.
We introduce Annotator-Centric Active Learning (ACAL), which incorporates an annotator selection strategy following data sampling.
Our objective is to efficiently approximate the full diversity of human judgments, and to assess model performance using annotator-centric metrics.
arXiv Detail & Related papers (2024-04-24T08:13:02Z) - Persona-DB: Efficient Large Language Model Personalization for Response Prediction with Collaborative Data Refinement [79.2400720115588]
We introduce Persona-DB, a simple yet effective framework consisting of a hierarchical construction process to improve generalization across task contexts.
In the evaluation of response prediction, Persona-DB demonstrates superior context efficiency in maintaining accuracy with a significantly reduced retrieval size.
Our experiments also indicate a marked improvement of over 10% under cold-start scenarios, when users have extremely sparse data.
arXiv Detail & Related papers (2024-02-16T20:20:43Z) - MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with
Diverse Human Preferences [101.57443597426374]
Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data.
We learn a mixture of preference distributions via an expectation-maximization algorithm to better represent diverse human preferences.
Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms.
arXiv Detail & Related papers (2024-02-14T03:56:27Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - Calibrating LLM-Based Evaluator [92.17397504834825]
We propose AutoCalibrate, a multi-stage, gradient-free approach to calibrate and align an LLM-based evaluator toward human preference.
Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels.
Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration.
arXiv Detail & Related papers (2023-09-23T08:46:11Z) - Bring Your Own Data! Self-Supervised Evaluation for Large Language
Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs)
We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence.
We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z) - Dynamic Human Evaluation for Relative Model Comparisons [8.843915018287476]
We present a dynamic approach to measure the required number of human annotations when evaluating generated outputs in relative comparison settings.
We propose an agent-based framework of human evaluation to assess multiple labelling strategies and methods to decide the better model in a simulation and a crowdsourcing case study.
arXiv Detail & Related papers (2021-12-15T11:32:13Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.