Related papers: Can Large Language Models Be an Alternative to Human Evaluations?

Can Large Language Models Be an Alternative to Human Evaluations?

URL: http://arxiv.org/abs/2305.01937v1
Date: Wed, 3 May 2023 07:28:50 GMT
Title: Can Large Language Models Be an Alternative to Human Evaluations?
Authors: Cheng-Han Chiang and Hung-yi Lee
Abstract summary: Large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided. We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation.
Score: 80.81532239566992
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Human evaluation is indispensable and inevitable for assessing the quality of texts generated by machine learning models or written by humans. However, human evaluation is very difficult to reproduce and its quality is notoriously unstable, hindering fair comparisons among different natural language processing (NLP) models and algorithms. Recently, large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided. In this paper, we explore if such an ability of the LLMs can be used as an alternative to human evaluation. We present the LLMs with the exact same instructions, samples to be evaluated, and questions used to conduct human evaluation, and then ask the LLMs to generate responses to those questions; we dub this LLM evaluation. We use human evaluation and LLM evaluation to evaluate the texts in two NLP tasks: open-ended story generation and adversarial attacks. We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation: the texts rated higher by human experts are also rated higher by the LLMs. We also find that the results of LLM evaluation are stable over different formatting of the task instructions and the sampling algorithm used to generate the answer. We are the first to show the potential of using LLMs to assess the quality of texts and discuss the limitations and ethical considerations of LLM evaluation.

Related papers

Can LLMs Replace Human Evaluators? An Empirical Study of LLM-as-a-Judge in Software Engineering [18.766132076075365]
Large language models (LLMs) have been deployed to tackle various software engineering (SE) tasks like code generation. Pass@k metric requires extensive unit tests and configured environments, and is not suitable for evaluating LLM-generated text. Conventional metrics like BLEU, which measure only lexical rather than semantic similarity, have also come under scrutiny.
arXiv Detail & Related papers (2025-02-10T06:49:29Z)
Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators. The question of how reliable these evaluators are has emerged as a crucial research question. We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z)
Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition [46.949604465227054]
We propose a sample-efficient human evaluation method based on MAximum Discrepancy (MAD) competition. MAD automatically selects a small set of informative and diverse instructions, each adapted to two LLMs. The pairwise comparison results are then aggregated into a global ranking using the Elo rating system.
arXiv Detail & Related papers (2024-04-10T01:26:24Z)
PRE: A Peer Review Based Large Language Model Evaluator [14.585292530642603]
Existing paradigms rely on either human annotators or model-based evaluators to evaluate the performance of LLMs. We propose a novel framework that can automatically evaluate LLMs through a peer-review process.
arXiv Detail & Related papers (2024-01-28T12:33:14Z)
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization. Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z)
TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs [35.717370285231176]
Large language models (LLMs) have shown impressive capabilities across various natural language tasks. We propose a comprehensive human evaluation framework to assess LLMs' proficiency in following instructions on diverse real-world tasks.
arXiv Detail & Related papers (2023-11-09T13:58:59Z)
Collaborative Evaluation: Exploring the Synergy of Large Language Models and Humans for Open-ended Generation Evaluation [71.76872586182981]
Large language models (LLMs) have emerged as a scalable and cost-effective alternative to human evaluations. We propose a Collaborative Evaluation pipeline CoEval, involving the design of a checklist of task-specific criteria and the detailed evaluation of texts.
arXiv Detail & Related papers (2023-10-30T17:04:35Z)
Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs. We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z)
ALLURE: Auditing and Improving LLM-based Evaluation of Text using Iterative In-Context-Learning [7.457517083017178]
Large language models (LLMs) are used for evaluation of text generated by humans and AI alike. Despite their utility, LLMs exhibit distinct failure modes, necessitating a thorough audit and improvement of their text evaluation capabilities. Here we introduce ALLURE, a systematic approach to Auditing Large Language Models Understanding and Reasoning Errors.
arXiv Detail & Related papers (2023-09-24T17:15:58Z)
Benchmarking Large Language Models for News Summarization [79.37850439866938]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. We find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability.
arXiv Detail & Related papers (2023-01-31T18:46:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.