Related papers: Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data

Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data

URL: http://arxiv.org/abs/2410.13341v1
Date: Thu, 17 Oct 2024 08:49:42 GMT
Title: Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
Authors: Florian E. Dorner, Vivian Y. Nastl, Moritz Hardt,
Abstract summary: An emerging family of debiasing tools promises to fix issues by using a few high quality labels to debias a large number of model judgments. Our main result shows that when the judge is no more accurate than the evaluated model, no debiasing method can decrease the required amount of ground truth labels by more than half.
Score: 14.95829896035971
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: High quality annotations are increasingly a bottleneck in the explosively growing machine learning ecosystem. Scalable evaluation methods that avoid costly annotation have therefore become an important research ambition. Many hope to use strong existing models in lieu of costly labels to provide cheap model evaluations. Unfortunately, this method of using models as judges introduces biases, such as self-preferencing, that can distort model comparisons. An emerging family of debiasing tools promises to fix these issues by using a few high quality labels to debias a large number of model judgments. In this paper, we study how far such debiasing methods, in principle, can go. Our main result shows that when the judge is no more accurate than the evaluated model, no debiasing method can decrease the required amount of ground truth labels by more than half. Our result speaks to the severe limitations of the LLM-as-a-judge paradigm at the evaluation frontier where the goal is to assess newly released models that are possibly better than the judge. Through an empirical evaluation, we demonstrate that the sample size savings achievable in practice are even more modest than what our theoretical limit suggests. Along the way, our work provides new observations about debiasing methods for model evaluation, and points out promising avenues for future work.

Related papers

Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models [68.92020689188887]
Retrieval-Augmented Generation (RAG) has proven its effectiveness in alleviating hallucinations for Large Language Models (LLMs) Existing automated evaluation metrics cannot fairly evaluate the outputs generated by RAG models during training and evaluation. This paper introduces the Judge-Consistency (ConsJudge) method, which aims to enhance LLMs to generate more accurate evaluations for RAG models.
arXiv Detail & Related papers (2025-02-26T04:50:43Z)
JudgeBlender: Ensembling Judgments for Automatic Relevance Assessment [28.4353755578306]
Large Language Models (LLMs) have shown promise in generating relevance labels for search tasks. We introduce JudgeBlender, a framework that employs smaller, open-source models to provide relevance judgments.
arXiv Detail & Related papers (2024-12-17T19:04:15Z)
Self-rationalization improves LLM as a fine-grained judge [21.917301609125417]
We introduce Self-Rationalization, an iterative process of improving the rationales for the judge models. Self-rationalization works by having the model generate multiple judgments with rationales for the same input. We show that our model learns to produce higher quality rationales, with a win rate of $62%$ on average compared to models just trained via SFT on rationale.
arXiv Detail & Related papers (2024-10-07T21:05:53Z)
Mitigating the Bias of Large Language Model Evaluation [30.67730115141905]
We propose systematic research about the bias of LLM-as-a-Judge. For closed-source judge models, we apply calibration to mitigate the significance of superficial quality. For open-source judge models, we propose to mitigate the bias by contrastive training, with curated negative samples that deviate from instruction but present better superficial quality.
arXiv Detail & Related papers (2024-09-25T09:52:44Z)
Direct Judgement Preference Optimization [66.83088028268318]
We train large language models (LLMs) as generative judges to evaluate and critique other models' outputs. We employ three approaches to collect the preference pairs for different use cases, each aimed at improving our generative judge from a different perspective. Our model robustly counters inherent biases such as position and length bias, flexibly adapts to any evaluation protocol specified by practitioners, and provides helpful language feedback for improving downstream generator models.
arXiv Detail & Related papers (2024-09-23T02:08:20Z)
OffsetBias: Leveraging Debiased Data for Tuning Evaluators [1.5790747258969664]
We qualitatively identify six types of biases inherent in various judge models. Fine-tuning on our dataset significantly enhances the robustness of judge models against biases.
arXiv Detail & Related papers (2024-07-09T05:16:22Z)
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges [6.609843448260634]
The LLM-as-a-judge paradigm is rapidly gaining traction as an approach to evaluating large language models. This paper focuses on a clean scenario in which inter-human agreement is high. We identify vulnerabilities in judge models, such as their sensitivity to prompt complexity and length, and a tendency toward leniency.
arXiv Detail & Related papers (2024-06-18T13:49:54Z)
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models [56.02275285521847]
We propose to evaluate models using a Panel of LLm evaluators (PoLL) We find that using a PoLL composed of a larger number of smaller models outperforms a single large judge, exhibits less intra-model bias due to its composition of disjoint model families, and does so while being over seven times less expensive.
arXiv Detail & Related papers (2024-04-29T15:33:23Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
Fast Model Debias with Machine Unlearning [54.32026474971696]
Deep neural networks might behave in a biased manner in many real-world scenarios. Existing debiasing methods suffer from high costs in bias labeling or model re-training. We propose a fast model debiasing framework (FMD) which offers an efficient approach to identify, evaluate and remove biases.
arXiv Detail & Related papers (2023-10-19T08:10:57Z)
Label-Retrieval-Augmented Diffusion Models for Learning from Noisy Labels [61.97359362447732]
Learning from noisy labels is an important and long-standing problem in machine learning for real applications. In this paper, we reformulate the label-noise problem from a generative-model perspective. Our model achieves new state-of-the-art (SOTA) results on all the standard real-world benchmark datasets.
arXiv Detail & Related papers (2023-05-31T03:01:36Z)
Spuriosity Rankings: Sorting Data to Measure and Mitigate Biases [62.54519787811138]
We present a simple but effective method to measure and mitigate model biases caused by reliance on spurious cues. We rank images within their classes based on spuriosity, proxied via deep neural features of an interpretable network. Our results suggest that model bias due to spurious feature reliance is influenced far more by what the model is trained on than how it is trained.
arXiv Detail & Related papers (2022-12-05T23:15:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.