Accelerating Unbiased LLM Evaluation via Synthetic Feedback
- URL: http://arxiv.org/abs/2502.10563v1
- Date: Fri, 14 Feb 2025 21:27:09 GMT
- Title: Accelerating Unbiased LLM Evaluation via Synthetic Feedback
- Authors: Zhaoyi Zhou, Yuda Song, Andrea Zanette,
- Abstract summary: We propose a statistically principled framework that integrates human and synthetic feedback to reduce reliance on human annotations.
Our experiments demonstrate a reduction in human annotations by up to 12.2% with an off-the-shelf synthetic evaluator and up to 24.8% with a finetuned variant.
- Score: 17.597195550638343
- License:
- Abstract: When developing new large language models (LLMs), a key step is evaluating their final performance, often by computing the win-rate against a reference model based on external feedback. Human feedback is the gold standard, particularly for capturing nuanced qualities like coherence, readability, and alignment with human expectations. However, human evaluations are costly -- even for large tech companies -- and when conducted with active users, they may negatively impact user experience. A promising alternative is synthetic feedback, where evaluations are conducted by other large language models, including reward models. While this eliminates the need for costly human annotations, it introduces biases that may distort the evaluation process. In this work, we propose a statistically principled framework that integrates human and synthetic feedback to reduce reliance on human annotations while maintaining unbiased win-rate calculations. Our experiments demonstrate a reduction in human annotations by up to 12.2% with an off-the-shelf synthetic evaluator and up to 24.8% with a finetuned variant. Apart from being generalizable, scalable, and free of hyper-parameter tuning, our method offers predictable annotation savings, which can be estimated based on data-dependent characteristics.
Related papers
- Poor-Supervised Evaluation for SuperLLM via Mutual Consistency [20.138831477848615]
We propose the PoEM framework to conduct evaluation without accurate labels.
We first prove that the capability of a model can be equivalently assessed by the consistency between it and certain reference model.
To alleviate the insufficiencies of the conditions in reality, we introduce an algorithm that treats humans (when available) and the models under evaluation as reference models.
arXiv Detail & Related papers (2024-08-25T06:49:03Z) - Aligning Model Evaluations with Human Preferences: Mitigating Token Count Bias in Language Model Assessments [2.1370543868467275]
This follow-up paper explores methods to align Large Language Models evaluator preferences with human evaluations.
We employed Bayesian statistics and a t-test to quantify this bias and developed a recalibration procedure to adjust the GPTScorer.
Our findings significantly improve aligning the recalibrated LLM evaluator with human evaluations across multiple use cases.
arXiv Detail & Related papers (2024-07-05T09:26:40Z) - LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks [106.09361690937618]
There is an increasing trend towards evaluating NLP models with LLMs instead of human judgments.
We provide JUDGE-BENCH, a collection of 20 NLP datasets with human annotations covering a broad range of evaluated properties and types of data.
We evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations.
arXiv Detail & Related papers (2024-06-26T14:56:13Z) - Beyond Thumbs Up/Down: Untangling Challenges of Fine-Grained Feedback for Text-to-Image Generation [67.88747330066049]
Fine-grained feedback captures nuanced distinctions in image quality and prompt-alignment.
We show that demonstrating its superiority to coarse-grained feedback is not automatic.
We identify key challenges in eliciting and utilizing fine-grained feedback.
arXiv Detail & Related papers (2024-06-24T17:19:34Z) - Off-Policy Evaluation from Logged Human Feedback [23.88252045734197]
We study off-policy evaluation from logged human feedback.
We propose both model-based and model-free estimators for policy values.
Our estimators can predict the absolute values of evaluated policies, rank them, and be optimized.
arXiv Detail & Related papers (2024-06-14T13:38:18Z) - Which Prompts Make The Difference? Data Prioritization For Efficient
Human LLM Evaluation [9.452326973655445]
We find that metric-based methods enhance the efficiency of human evaluations by minimizing the number of required annotations.
We show that our method is effective across widely used model families, reducing instances of indecisive (or "tie") outcomes by up to 54%.
This potential reduction in required human effort positions our approach as a valuable strategy in future large language model evaluations.
arXiv Detail & Related papers (2023-10-22T21:48:51Z) - SocREval: Large Language Models with the Socratic Method for Reference-Free Reasoning Evaluation [78.23119125463964]
We develop SocREval, a novel approach for prompt design in reference-free reasoning evaluation.
SocREval significantly improves GPT-4's performance, surpassing existing reference-free and reference-based reasoning evaluation metrics.
arXiv Detail & Related papers (2023-09-29T18:25:46Z) - Human Feedback is not Gold Standard [28.63384327791185]
We critically analyse the use of human feedback for both training and evaluation.
We find that while preference scores have fairly good coverage, they under-represent important aspects like factuality.
arXiv Detail & Related papers (2023-09-28T11:18:20Z) - Learning and Evaluating Human Preferences for Conversational Head
Generation [101.89332968344102]
We propose a novel learning-based evaluation metric named Preference Score (PS) for fitting human preference according to the quantitative evaluations across different dimensions.
PS can serve as a quantitative evaluation without the need for human annotation.
arXiv Detail & Related papers (2023-07-20T07:04:16Z) - Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural
Language Generation [68.9440575276396]
This survey aims to provide an overview of the recent research that has leveraged human feedback to improve natural language generation.
First, we introduce an encompassing formalization of feedback, and identify and organize existing research into a taxonomy following this formalization.
Second, we discuss how feedback can be described by its format and objective, and cover the two approaches proposed to use feedback (either for training or decoding): directly using the feedback or training feedback models.
Third, we provide an overview of the nascent field of AI feedback, which exploits large language models to make judgments based on a set of principles and minimize the need for
arXiv Detail & Related papers (2023-05-01T17:36:06Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.