Related papers: Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation

Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation

URL: http://arxiv.org/abs/2310.14424v1
Date: Sun, 22 Oct 2023 21:48:51 GMT
Title: Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation
Authors: Meriem Boubdir, Edward Kim, Beyza Ermis, Marzieh Fadaee, Sara Hooker
Abstract summary: We find that metric-based methods enhance the efficiency of human evaluations by minimizing the number of required annotations. We show that our method is effective across widely used model families, reducing instances of indecisive (or "tie") outcomes by up to 54%. This potential reduction in required human effort positions our approach as a valuable strategy in future large language model evaluations.
Score: 9.452326973655445
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human evaluation is increasingly critical for assessing large language models, capturing linguistic nuances, and reflecting user preferences more accurately than traditional automated metrics. However, the resource-intensive nature of this type of annotation process poses significant challenges. The key question driving our work: "is it feasible to minimize human-in-the-loop feedback by prioritizing data instances which most effectively distinguish between models?" We evaluate several metric-based methods and find that these metrics enhance the efficiency of human evaluations by minimizing the number of required annotations, thus saving time and cost, while ensuring a robust performance evaluation. We show that our method is effective across widely used model families, reducing instances of indecisive (or "tie") outcomes by up to 54% compared to a random sample when focusing on the top-20 percentile of prioritized instances. This potential reduction in required human effort positions our approach as a valuable strategy in future large language model evaluations.

Related papers

Maximizing Signal in Human-Model Preference Alignment [0.0]
This paper argues that in cases in which end users need to agree with the decisions made by ML models, models should be trained and evaluated on data that represent their preferences. We show that noise in labeling disagreement can be minimized by adhering to proven methodological best practices.
arXiv Detail & Related papers (2025-03-06T19:10:57Z)
Optimizing Alignment with Less: Leveraging Data Augmentation for Personalized Evaluation [2.933641361932625]
We present a data augmentation technique to select a more effective sample from limited data in order to align an open LLM with human preference. Our work achieves approximately 7% improvements in Pearson correlation with a reference judge over the baseline.
arXiv Detail & Related papers (2024-12-10T11:40:11Z)
Towards Unifying Evaluation of Counterfactual Explanations: Leveraging Large Language Models for Human-Centric Assessments [0.7852714805965528]
We develop a set of 30 counterfactual scenarios and collect ratings across 8 evaluation metrics from 206 respondents. We fine-tuned different Large Language Models to predict average or individual human judgment across these metrics.
arXiv Detail & Related papers (2024-10-28T15:33:37Z)
Aligning Model Evaluations with Human Preferences: Mitigating Token Count Bias in Language Model Assessments [2.1370543868467275]
This follow-up paper explores methods to align Large Language Models evaluator preferences with human evaluations. We employed Bayesian statistics and a t-test to quantify this bias and developed a recalibration procedure to adjust the GPTScorer. Our findings significantly improve aligning the recalibrated LLM evaluator with human evaluations across multiple use cases.
arXiv Detail & Related papers (2024-07-05T09:26:40Z)
HARE: HumAn pRiors, a key to small language model Efficiency [6.253561984966316]
Human priors play a crucial role in efficiently utilizing data in deep learning. Existing Small Language Models mainly rely on web-scraped large-scale training data. We propose a principle to leverage human priors for data construction.
arXiv Detail & Related papers (2024-06-17T10:56:03Z)
Annotator-Centric Active Learning for Subjective NLP Tasks [7.766754308448708]
Active Learning (AL) addresses the high costs of collecting human annotations by strategically annotating the most informative samples. We introduce Annotator-Centric Active Learning (ACAL), which incorporates an annotator selection strategy following data sampling. Our objective is to efficiently approximate the full diversity of human judgments, and to assess model performance using annotator-centric metrics.
arXiv Detail & Related papers (2024-04-24T08:13:02Z)
Persona-DB: Efficient Large Language Model Personalization for Response Prediction with Collaborative Data Refinement [79.2400720115588]
We introduce Persona-DB, a simple yet effective framework consisting of a hierarchical construction process to improve generalization across task contexts. In the evaluation of response prediction, Persona-DB demonstrates superior context efficiency in maintaining accuracy with a significantly reduced retrieval size. Our experiments also indicate a marked improvement of over 10% under cold-start scenarios, when users have extremely sparse data.
arXiv Detail & Related papers (2024-02-16T20:20:43Z)
MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences [101.57443597426374]
Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data. We learn a mixture of preference distributions via an expectation-maximization algorithm to better represent diverse human preferences. Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms.
arXiv Detail & Related papers (2024-02-14T03:56:27Z)
Sample Efficient Preference Alignment in LLMs via Active Exploration [63.84454768573154]
We take advantage of the fact that one can often choose contexts at which to obtain human feedback to most efficiently identify a good policy. We propose an active exploration algorithm to efficiently select the data and provide theoretical proof that it has a worst-case regret bound. Our method outperforms the baselines with limited samples of human preferences on several language models and four real-world datasets.
arXiv Detail & Related papers (2023-12-01T00:54:02Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
Calibrating LLM-Based Evaluator [92.17397504834825]
We propose AutoCalibrate, a multi-stage, gradient-free approach to calibrate and align an LLM-based evaluator toward human preference. Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels. Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration.
arXiv Detail & Related papers (2023-09-23T08:46:11Z)
Bring Your Own Data! Self-Supervised Evaluation for Large Language Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs) We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence. We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z)
Dynamic Human Evaluation for Relative Model Comparisons [8.843915018287476]
We present a dynamic approach to measure the required number of human annotations when evaluating generated outputs in relative comparison settings. We propose an agent-based framework of human evaluation to assess multiple labelling strategies and methods to decide the better model in a simulation and a crowdsourcing case study.
arXiv Detail & Related papers (2021-12-15T11:32:13Z)
Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation. Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.