LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain
Conversations with Large Language Models
- URL: http://arxiv.org/abs/2305.13711v1
- Date: Tue, 23 May 2023 05:57:09 GMT
- Title: LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain
Conversations with Large Language Models
- Authors: Yen-Ting Lin, Yun-Nung Chen
- Abstract summary: We propose a unified multi-dimensional automatic evaluation method for open-domain conversations with large language models (LLMs)
We design a single prompt-based evaluation method that leverages a unified evaluation schema to cover multiple dimensions of conversation quality in a single model call.
We extensively evaluate the performance of LLM-Eval on various benchmark datasets, demonstrating its effectiveness, efficiency, and adaptability compared to state-of-the-art evaluation methods.
- Score: 28.441725610692714
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose LLM-Eval, a unified multi-dimensional automatic evaluation method
for open-domain conversations with large language models (LLMs). Existing
evaluation methods often rely on human annotations, ground-truth responses, or
multiple LLM prompts, which can be expensive and time-consuming. To address
these issues, we design a single prompt-based evaluation method that leverages
a unified evaluation schema to cover multiple dimensions of conversation
quality in a single model call. We extensively evaluate the performance of
LLM-Eval on various benchmark datasets, demonstrating its effectiveness,
efficiency, and adaptability compared to state-of-the-art evaluation methods.
Our analysis also highlights the importance of choosing suitable LLMs and
decoding strategies for accurate evaluation results. LLM-Eval offers a
versatile and robust solution for evaluating open-domain conversation systems,
streamlining the evaluation process and providing consistent performance across
diverse scenarios.
Related papers
- AIME: AI System Optimization via Multiple LLM Evaluators [79.03422337674664]
AIME is an evaluation protocol that utilizes multiple LLMs that each independently generate an evaluation on separate criteria and then combine them via concatenation.
We show AIME outperforming baseline methods in code generation tasks, with up to $62%$ higher error detection rate and up to $16%$ higher success rate than a single LLM evaluation protocol on LeetCodeHard and HumanEval datasets.
arXiv Detail & Related papers (2024-10-04T04:03:24Z) - Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation.
Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z) - Can Large Language Models be Trusted for Evaluation? Scalable
Meta-Evaluation of LLMs as Evaluators via Agent Debate [74.06294042304415]
We propose ScaleEval, an agent-debate-assisted meta-evaluation framework.
We release the code for our framework, which is publicly available on GitHub.
arXiv Detail & Related papers (2024-01-30T07:03:32Z) - State of What Art? A Call for Multi-Prompt LLM Evaluation [28.307860675006545]
We comprehensively analyze the brittleness of results obtained via single-prompt evaluations across 6.5M instances.
To improve robustness of the analysis, we propose to evaluate LLMs with a set of diverse prompts instead.
arXiv Detail & Related papers (2023-12-31T22:21:36Z) - A Comprehensive Analysis of the Effectiveness of Large Language Models
as Automatic Dialogue Evaluators [46.939611070781794]
Large language models (LLMs) are shown to be promising substitutes for human judges.
We analyze the multi-dimensional evaluation capability of 30 recently emerged LLMs at both turn and dialogue levels.
We also probe the robustness of the LLMs in handling various adversarial perturbations at both turn and dialogue levels.
arXiv Detail & Related papers (2023-12-24T04:50:57Z) - MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria [49.500322937449326]
Multimodal large language models (MLLMs) have broadened the scope of AI applications.
Existing automatic evaluation methodologies for MLLMs are mainly limited in evaluating queries without considering user experiences.
We propose a new evaluation paradigm for MLLMs, which is evaluating MLLMs with per-sample criteria using potent MLLM as the judge.
arXiv Detail & Related papers (2023-11-23T12:04:25Z) - Beyond Static Datasets: A Deep Interaction Approach to LLM Evaluation [16.73300162869746]
Large Language Models (LLMs) have made progress in various real-world tasks.
Existing evaluation methods are mainly supervised signal-based.
We propose a novel Deep Interaction-based LLM-evaluation framework.
arXiv Detail & Related papers (2023-09-08T15:00:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.