Can Large Language Models be Trusted for Evaluation? Scalable
Meta-Evaluation of LLMs as Evaluators via Agent Debate
- URL: http://arxiv.org/abs/2401.16788v1
- Date: Tue, 30 Jan 2024 07:03:32 GMT
- Title: Can Large Language Models be Trusted for Evaluation? Scalable
Meta-Evaluation of LLMs as Evaluators via Agent Debate
- Authors: Steffi Chern, Ethan Chern, Graham Neubig, Pengfei Liu
- Abstract summary: We propose ScaleEval, an agent-debate-assisted meta-evaluation framework.
We release the code for our framework, which is publicly available on GitHub.
- Score: 74.06294042304415
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the utility of Large Language Models (LLMs) across a wide range of
tasks and scenarios, developing a method for reliably evaluating LLMs across
varied contexts continues to be challenging. Modern evaluation approaches often
use LLMs to assess responses generated by LLMs. However, the meta-evaluation
conducted to assess the effectiveness of these LLMs as evaluators is typically
constrained by the coverage of existing benchmarks or requires extensive human
annotation. This underscores the urgency of methods for scalable
meta-evaluation that can effectively, reliably, and efficiently evaluate the
performance of LLMs as evaluators across diverse tasks and scenarios,
particularly in potentially new, user-defined scenarios. To fill this gap, we
propose ScaleEval, an agent-debate-assisted meta-evaluation framework that
leverages the capabilities of multiple communicative LLM agents. This framework
supports multi-round discussions to assist human annotators in discerning the
most capable LLMs as evaluators, which significantly eases their workload in
cases that used to require large-scale annotations during meta-evaluation. We
release the code for our framework, which is publicly available at:
\url{https://github.com/GAIR-NLP/scaleeval}.
Related papers
- CIBench: Evaluating Your LLMs with a Code Interpreter Plugin [68.95137938214862]
We propose an interactive evaluation framework, named CIBench, to comprehensively assess LLMs' ability to utilize code interpreters for data science tasks.
The evaluation dataset is constructed using an LLM-human cooperative approach and simulates an authentic workflow by leveraging consecutive and interactive IPython sessions.
We conduct extensive experiments to analyze the ability of 24 LLMs on CIBench and provide valuable insights for future LLMs in code interpreter utilization.
arXiv Detail & Related papers (2024-07-15T07:43:55Z) - Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation.
Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z) - State of What Art? A Call for Multi-Prompt LLM Evaluation [28.307860675006545]
We comprehensively analyze the brittleness of results obtained via single-prompt evaluations across 6.5M instances.
To improve robustness of the analysis, we propose to evaluate LLMs with a set of diverse prompts instead.
arXiv Detail & Related papers (2023-12-31T22:21:36Z) - MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria [49.500322937449326]
Multimodal large language models (MLLMs) have broadened the scope of AI applications.
Existing automatic evaluation methodologies for MLLMs are mainly limited in evaluating queries without considering user experiences.
We propose a new evaluation paradigm for MLLMs, which is evaluating MLLMs with per-sample criteria using potent MLLM as the judge.
arXiv Detail & Related papers (2023-11-23T12:04:25Z) - Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs.
We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z) - Beyond Static Datasets: A Deep Interaction Approach to LLM Evaluation [16.73300162869746]
Large Language Models (LLMs) have made progress in various real-world tasks.
Existing evaluation methods are mainly supervised signal-based.
We propose a novel Deep Interaction-based LLM-evaluation framework.
arXiv Detail & Related papers (2023-09-08T15:00:41Z) - Through the Lens of Core Competency: Survey on Evaluation of Large
Language Models [27.271533306818732]
Large language model (LLM) has excellent performance and wide practical uses.
Existing evaluation tasks are difficult to keep up with the wide range of applications in real-world scenarios.
We summarize 4 core competencies of LLM, including reasoning, knowledge, reliability, and safety.
Under this competency architecture, similar tasks are combined to reflect corresponding ability, while new tasks can also be easily added into the system.
arXiv Detail & Related papers (2023-08-15T17:40:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.