Leveraging LLMs as Meta-Judges: A Multi-Agent Framework for Evaluating LLM Judgments
- URL: http://arxiv.org/abs/2504.17087v1
- Date: Wed, 23 Apr 2025 20:32:12 GMT
- Title: Leveraging LLMs as Meta-Judges: A Multi-Agent Framework for Evaluating LLM Judgments
- Authors: Yuran Li, Jama Hussein Mohamud, Chongren Sun, Di Wu, Benoit Boulet,
- Abstract summary: Large language models (LLMs) are being widely applied across various fields, but as tasks become more complex, evaluating their responses is increasingly challenging.<n>We propose a three-stage meta-judge selection pipeline: 1) developing a comprehensive rubric with GPT-4 and human experts, 2) using three advanced LLM agents to score judgments, and 3) applying a threshold to filter out low-scoring judgments.<n> Experimental results on the JudgeBench dataset show about 15.55% improvement compared to raw judgments and about 8.37% improvement over the single-agent baseline.
- Score: 6.270885758858811
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) are being widely applied across various fields, but as tasks become more complex, evaluating their responses is increasingly challenging. Compared to human evaluators, the use of LLMs to support performance evaluation offers a more efficient alternative. However, most studies focus mainly on aligning LLMs' judgments with human preferences, overlooking the existence of biases and mistakes in human judgment. Furthermore, how to select suitable LLM judgments given multiple potential LLM responses remains underexplored. To address these two aforementioned issues, we propose a three-stage meta-judge selection pipeline: 1) developing a comprehensive rubric with GPT-4 and human experts, 2) using three advanced LLM agents to score judgments, and 3) applying a threshold to filter out low-scoring judgments. Compared to methods using a single LLM as both judge and meta-judge, our pipeline introduces multi-agent collaboration and a more comprehensive rubric. Experimental results on the JudgeBench dataset show about 15.55\% improvement compared to raw judgments and about 8.37\% improvement over the single-agent baseline. Our work demonstrates the potential of LLMs as meta-judges and lays the foundation for future research on constructing preference datasets for LLM-as-a-judge reinforcement learning.
Related papers
- LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods [21.601196380989542]
''LLMs-as-judges'' are evaluators based on natural language responses.<n>This paper presents a comprehensive survey of the ''LLMs-as-judges'' paradigm from five key perspectives.<n>We aim to provide insights on the development and application of ''LLMs-as-judges'' in both research and practice.
arXiv Detail & Related papers (2024-12-07T08:07:24Z) - CIBench: Evaluating Your LLMs with a Code Interpreter Plugin [68.95137938214862]
We propose an interactive evaluation framework, named CIBench, to comprehensively assess LLMs' ability to utilize code interpreters for data science tasks.
The evaluation dataset is constructed using an LLM-human cooperative approach and simulates an authentic workflow by leveraging consecutive and interactive IPython sessions.
We conduct extensive experiments to analyze the ability of 24 LLMs on CIBench and provide valuable insights for future LLMs in code interpreter utilization.
arXiv Detail & Related papers (2024-07-15T07:43:55Z) - DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.<n>The question of how reliable these evaluators are has emerged as a crucial research question.<n>We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - Large Language Models are Inconsistent and Biased Evaluators [2.136983452580014]
We show that Large Language Models (LLMs) are biased evaluators as they exhibit familiarity bias and show skewed distributions of ratings.
We also found that LLMs are inconsistent evaluators, showing low "inter-sample" agreement and sensitivity to prompt differences that are insignificant to human understanding of text quality.
arXiv Detail & Related papers (2024-05-02T20:42:28Z) - MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark [41.68821233828375]
This paper introduces a novel benchmark, termed MLLM-as-a-Judge, to assess the ability of MLLMs in assisting judges across diverse modalities.
Our study reveals that, while MLLMs demonstrate remarkable human-like discernment in Pair Comparison, there is a significant divergence from human preferences in Scoring Evaluation and Batch Ranking.
arXiv Detail & Related papers (2024-02-07T12:28:32Z) - PiCO: Peer Review in LLMs based on the Consistency Optimization [48.48819141999387]
We use peer-review mechanisms to measure large language models (LLMs) automatically.<n>We formalize it as a constrained optimization problem, intending to maximize the consistency of each LLM's capabilities and scores.<n>We propose three metrics called PEN, CIN, and LIS to evaluate the gap in aligning human rankings.
arXiv Detail & Related papers (2024-02-02T18:49:26Z) - Can Large Language Models be Trusted for Evaluation? Scalable
Meta-Evaluation of LLMs as Evaluators via Agent Debate [74.06294042304415]
We propose ScaleEval, an agent-debate-assisted meta-evaluation framework.
We release the code for our framework, which is publicly available on GitHub.
arXiv Detail & Related papers (2024-01-30T07:03:32Z) - PRE: A Peer Review Based Large Language Model Evaluator [14.585292530642603]
Existing paradigms rely on either human annotators or model-based evaluators to evaluate the performance of LLMs.
We propose a novel framework that can automatically evaluate LLMs through a peer-review process.
arXiv Detail & Related papers (2024-01-28T12:33:14Z) - Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs.
We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z) - On Learning to Summarize with Large Language Models as References [101.79795027550959]
Large language models (LLMs) are favored by human annotators over the original reference summaries in commonly used summarization datasets.
We study an LLM-as-reference learning setting for smaller text summarization models to investigate whether their performance can be substantially improved.
arXiv Detail & Related papers (2023-05-23T16:56:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.