LLM-based NLG Evaluation: Current Status and Challenges
- URL: http://arxiv.org/abs/2402.01383v2
- Date: Mon, 26 Feb 2024 14:55:48 GMT
- Title: LLM-based NLG Evaluation: Current Status and Challenges
- Authors: Mingqi Gao, Xinyu Hu, Jie Ruan, Xiao Pu, Xiaojun Wan
- Abstract summary: evaluating natural language generation (NLG) is a vital but challenging problem in artificial intelligence.
Large language models (LLMs) have demonstrated great potential in NLG evaluation in recent years.
Various automatic evaluation methods based on LLMs have been proposed.
- Score: 41.69249290537395
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluating natural language generation (NLG) is a vital but challenging
problem in artificial intelligence. Traditional evaluation metrics mainly
capturing content (e.g. n-gram) overlap between system outputs and references
are far from satisfactory, and large language models (LLMs) such as ChatGPT
have demonstrated great potential in NLG evaluation in recent years. Various
automatic evaluation methods based on LLMs have been proposed, including
metrics derived from LLMs, prompting LLMs, and fine-tuning LLMs with labeled
evaluation data. In this survey, we first give a taxonomy of LLM-based NLG
evaluation methods, and discuss their pros and cons, respectively. We also
discuss human-LLM collaboration for NLG evaluation. Lastly, we discuss several
open problems in this area and point out future research directions.
Related papers
- DHP Benchmark: Are LLMs Good NLG Evaluators? [42.16315294351651]
Large Language Models (LLMs) are increasingly serving as evaluators in Natural Language Generation (NLG) tasks.
We propose the Discernment of Hierarchical Perturbation (DHP) benchmarking framework to measure the NLG evaluation capabilities of LLMs.
We have re-established six evaluation datasets for this benchmark, covering four NLG tasks: Summarization, Story Completion, Question Answering, and Translation.
arXiv Detail & Related papers (2024-08-25T02:01:38Z) - Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models [7.529095331830944]
In current benchmarks for evaluating large language models (LLMs), there are issues such as evaluation content restriction, untimely updates, and lack of optimization guidance.
We propose a new paradigm for the measurement of LLMs: Benchmarking-Evaluation-Assessment.
arXiv Detail & Related papers (2024-07-10T10:42:02Z) - Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability [39.12792986841385]
In this paper, we construct a large-scale NLG evaluation corpus NLG-Eval with annotations from both human and GPT-4.
We also propose an LLM dedicated to NLG evaluation, which has been trained with our designed multi-perspective consistency verification and rating-oriented preference alignment methods.
Themis exhibits superior evaluation performance on various NLG tasks, simultaneously generalizing well to unseen tasks and surpassing other evaluation models, including GPT-4.
arXiv Detail & Related papers (2024-06-26T14:04:29Z) - Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - Large Language Models: A Survey [69.72787936480394]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks.
LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z) - Leveraging Large Language Models for NLG Evaluation: Advances and Challenges [57.88520765782177]
Large Language Models (LLMs) have opened new avenues for assessing generated content quality, e.g., coherence, creativity, and context relevance.
We propose a coherent taxonomy for organizing existing LLM-based evaluation metrics, offering a structured framework to understand and compare these methods.
By discussing unresolved challenges, including bias, robustness, domain-specificity, and unified evaluation, this paper seeks to offer insights to researchers and advocate for fairer and more advanced NLG evaluation techniques.
arXiv Detail & Related papers (2024-01-13T15:59:09Z) - Can Large Language Models Be an Alternative to Human Evaluations? [80.81532239566992]
Large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided.
We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation.
arXiv Detail & Related papers (2023-05-03T07:28:50Z) - G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs.
We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.