Related papers: Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

URL: http://arxiv.org/abs/2404.18796v2
Date: Wed, 1 May 2024 15:37:11 GMT
Title: Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
Authors: Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, Patrick Lewis,
Abstract summary: We propose to evaluate models using a Panel of LLm evaluators (PoLL) We find that using a PoLL composed of a larger number of smaller models outperforms a single large judge, exhibits less intra-model bias due to its composition of disjoint model families, and does so while being over seven times less expensive.
Score: 56.02275285521847
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As Large Language Models (LLMs) have become more advanced, they have outpaced our abilities to accurately evaluate their quality. Not only is finding data to adequately probe particular model properties difficult, but evaluating the correctness of a model's freeform generation alone is a challenge. To address this, many evaluations now rely on using LLMs themselves as judges to score the quality of outputs from other LLMs. Evaluations most commonly use a single large model like GPT4. While this method has grown in popularity, it is costly, has been shown to introduce intramodel bias, and in this work, we find that very large models are often unnecessary. We propose instead to evaluate models using a Panel of LLm evaluators (PoLL). Across three distinct judge settings and spanning six different datasets, we find that using a PoLL composed of a larger number of smaller models outperforms a single large judge, exhibits less intra-model bias due to its composition of disjoint model families, and does so while being over seven times less expensive.

Related papers

Truth or Twist? Optimal Model Selection for Reliable Label Flipping Evaluation in LLM-based Counterfactuals [12.846807471105064]
Judge models are used to evaluate the validity of generated counterfactuals for large language models.<n>We show that judge models with an independent, non-fine-tuned relationship to the generator model provide the most reliable label flipping evaluations.<n>We find that the gap between the most effective judge models and the results obtained from the user study remains considerably large.
arXiv Detail & Related papers (2025-05-20T06:12:17Z)
Bi'an: A Bilingual Benchmark and Model for Hallucination Detection in Retrieval-Augmented Generation [6.549143816134529]
We introduce bftextBi'an, a novel framework featuring a bilingual benchmark dataset and lightweight judge models. The dataset supports rigorous evaluation across multiple RAG scenarios, while the judge models are fine-tuned from compact open-source LLMs.
arXiv Detail & Related papers (2025-02-26T15:12:59Z)
Preference Leakage: A Contamination Problem in LLM-as-a-judge [69.96778498636071]
Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods. In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators.
arXiv Detail & Related papers (2025-02-03T17:13:03Z)
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data [14.95829896035971]
An emerging family of debiasing tools promises to fix issues by using a few high quality labels to debias a large number of model judgments. Our main result shows that when the judge is no more accurate than the evaluated model, no debiasing method can decrease the required amount of ground truth labels by more than half.
arXiv Detail & Related papers (2024-10-17T08:49:42Z)
Bias Similarity Across Large Language Models [32.0365189539138]
We analyze bias through output distribution across multiple dimensions using two datasets (4K and 1M questions) Our results show that fine-tuning has minimal impact on output distributions, and proprietary models tend to overly response as unknowns to minimize bias, compromising accuracy and utility. Open-source models like Llama3-Chat and Gemma2-it demonstrate fairness comparable to proprietary models like GPT-4, challenging the assumption that larger, closed-source models are inherently less biased.
arXiv Detail & Related papers (2024-10-15T19:21:14Z)
What Matters for Model Merging at Scale? [94.26607564817786]
Model merging aims to combine multiple expert models into a more capable single model. Previous studies have primarily focused on merging a few small models. This study systematically evaluates the utility of model merging at scale.
arXiv Detail & Related papers (2024-10-04T17:17:19Z)
MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation? [59.7772329962047]
We introduce MJ-Bench, a novel benchmark which incorporates a comprehensive preference dataset to evaluate multimodal judges. Specifically, we evaluate a large variety of multimodal judges including smaller-sized CLIP-based scoring models, open-source VLMs, and close-source VLMs. Experiments reveal that close-source VLMs generally provide better feedback, with GPT-4o outperforming other judges in average.
arXiv Detail & Related papers (2024-07-05T20:03:16Z)
Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks [3.58262772907022]
We introduce the Language Model Council (LMC), where a group of LLMs collaborate to create tests, respond to them, and evaluate each other's responses to produce a ranking. In a detailed case study on emotional intelligence, we deploy a council of 20 recent LLMs to rank each other on open-ended responses to interpersonal conflicts. Our results show that the LMC produces rankings that are more separable and more robust, and through a user study, we show that they are more consistent with human evaluations than any individual LLM judge.
arXiv Detail & Related papers (2024-06-12T19:05:43Z)
Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models. This creates a barrier to fusing knowledge across individual models to yield a better single model. We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z)
Investigating Ensemble Methods for Model Robustness Improvement of Text Classifiers [66.36045164286854]
We analyze a set of existing bias features and demonstrate there is no single model that works best for all the cases. By choosing an appropriate bias model, we can obtain a better robustness result than baselines with a more sophisticated model design.
arXiv Detail & Related papers (2022-10-28T17:52:10Z)
Evaluation of HTR models without Ground Truth Material [2.4792948967354236]
evaluation of Handwritten Text Recognition models during their development is straightforward. But the evaluation process becomes tricky as soon as we switch from development to application. We show that lexicon-based evaluation can compete with lexicon-based methods.
arXiv Detail & Related papers (2022-01-17T01:26:09Z)
When Ensembling Smaller Models is More Efficient than Single Large Models [52.38997176317532]
We show that ensembles can outperform single models with both higher accuracy and requiring fewer total FLOPs to compute. This presents an interesting observation that output diversity in ensembling can often be more efficient than training larger models.
arXiv Detail & Related papers (2020-05-01T18:56:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.