DHP Benchmark: Are LLMs Good NLG Evaluators?
- URL: http://arxiv.org/abs/2408.13704v1
- Date: Sun, 25 Aug 2024 02:01:38 GMT
- Title: DHP Benchmark: Are LLMs Good NLG Evaluators?
- Authors: Yicheng Wang, Jiayi Yuan, Yu-Neng Chuang, Zhuoer Wang, Yingchi Liu, Mark Cusick, Param Kulkarni, Zhengping Ji, Yasser Ibrahim, Xia Hu,
- Abstract summary: Large Language Models (LLMs) are increasingly serving as evaluators in Natural Language Generation (NLG) tasks.
We propose the Discernment of Hierarchical Perturbation (DHP) benchmarking framework to measure the NLG evaluation capabilities of LLMs.
We have re-established six evaluation datasets for this benchmark, covering four NLG tasks: Summarization, Story Completion, Question Answering, and Translation.
- Score: 42.16315294351651
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) are increasingly serving as evaluators in Natural Language Generation (NLG) tasks. However, the capabilities of LLMs in scoring NLG quality remain inadequately explored. Current studies depend on human assessments and simple metrics that fail to capture the discernment of LLMs across diverse NLG tasks. To address this gap, we propose the Discernment of Hierarchical Perturbation (DHP) benchmarking framework, which provides quantitative discernment scores for LLMs utilizing hierarchically perturbed text data and statistical tests to measure the NLG evaluation capabilities of LLMs systematically. We have re-established six evaluation datasets for this benchmark, covering four NLG tasks: Summarization, Story Completion, Question Answering, and Translation. Our comprehensive benchmarking of five major LLM series provides critical insight into their strengths and limitations as NLG evaluators.
Related papers
- Understanding the Role of LLMs in Multimodal Evaluation Benchmarks [77.59035801244278]
This paper investigates the role of the Large Language Model (LLM) backbone in Multimodal Large Language Models (MLLMs) evaluation.
Our study encompasses four diverse MLLM benchmarks and eight state-of-the-art MLLMs.
Key findings reveal that some benchmarks allow high performance even without visual inputs and up to 50% of error rates can be attributed to insufficient world knowledge in the LLM backbone.
arXiv Detail & Related papers (2024-10-16T07:49:13Z) - Finding Blind Spots in Evaluator LLMs with Interpretable Checklists [23.381287828102995]
We investigate the effectiveness of Large Language Models (LLMs) as evaluators for text generation tasks.
We propose FBI, a novel framework designed to examine the proficiency of Evaluator LLMs in assessing four critical abilities.
arXiv Detail & Related papers (2024-06-19T10:59:48Z) - Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - METAL: Towards Multilingual Meta-Evaluation [12.852595634767901]
This study proposes a framework for an end-to-end assessment of Large Language Models (LLMs) as evaluators in multilingual scenarios.
We create a dataset covering 10 languages containing native speaker judgments for the task of summarization.
We compare the performance of LLM-based evaluators created using GPT-3.5-Turbo, GPT-4, and PaLM2.
arXiv Detail & Related papers (2024-04-02T06:14:54Z) - LLM-based NLG Evaluation: Current Status and Challenges [41.69249290537395]
evaluating natural language generation (NLG) is a vital but challenging problem in artificial intelligence.
Large language models (LLMs) have demonstrated great potential in NLG evaluation in recent years.
Various automatic evaluation methods based on LLMs have been proposed.
arXiv Detail & Related papers (2024-02-02T13:06:35Z) - Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization.
Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z) - Through the Lens of Core Competency: Survey on Evaluation of Large
Language Models [27.271533306818732]
Large language model (LLM) has excellent performance and wide practical uses.
Existing evaluation tasks are difficult to keep up with the wide range of applications in real-world scenarios.
We summarize 4 core competencies of LLM, including reasoning, knowledge, reliability, and safety.
Under this competency architecture, similar tasks are combined to reflect corresponding ability, while new tasks can also be easily added into the system.
arXiv Detail & Related papers (2023-08-15T17:40:34Z) - A Survey on Evaluation of Large Language Models [87.60417393701331]
Large language models (LLMs) are gaining increasing popularity in both academia and industry.
This paper focuses on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.
arXiv Detail & Related papers (2023-07-06T16:28:35Z) - G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs.
We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.