A Survey of Evaluation Metrics Used for NLG Systems
- URL: http://arxiv.org/abs/2008.12009v2
- Date: Mon, 5 Oct 2020 17:38:22 GMT
- Title: A Survey of Evaluation Metrics Used for NLG Systems
- Authors: Ananya B. Sai, Akash Kumar Mohankumar, Mitesh M. Khapra
- Abstract summary: The success of Deep Learning has created a surge in interest in a wide a range of Natural Language Generation (NLG) tasks.
Unlike classification tasks, automatically evaluating NLG systems in itself is a huge challenge.
The expanding number of NLG models and the shortcomings of the current metrics has led to a rapid surge in the number of evaluation metrics proposed since 2014.
- Score: 19.20118684502313
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The success of Deep Learning has created a surge in interest in a wide a
range of Natural Language Generation (NLG) tasks. Deep Learning has not only
pushed the state of the art in several existing NLG tasks but has also
facilitated researchers to explore various newer NLG tasks such as image
captioning. Such rapid progress in NLG has necessitated the development of
accurate automatic evaluation metrics that would allow us to track the progress
in the field of NLG. However, unlike classification tasks, automatically
evaluating NLG systems in itself is a huge challenge. Several works have shown
that early heuristic-based metrics such as BLEU, ROUGE are inadequate for
capturing the nuances in the different NLG tasks. The expanding number of NLG
models and the shortcomings of the current metrics has led to a rapid surge in
the number of evaluation metrics proposed since 2014. Moreover, various
evaluation metrics have shifted from using pre-determined heuristic-based
formulae to trained transformer models. This rapid change in a relatively short
time has led to the need for a survey of the existing NLG metrics to help
existing and new researchers to quickly come up to speed with the developments
that have happened in NLG evaluation in the last few years. Through this
survey, we first wish to highlight the challenges and difficulties in
automatically evaluating NLG systems. Then, we provide a coherent taxonomy of
the evaluation metrics to organize the existing metrics and to better
understand the developments in the field. We also describe the different
metrics in detail and highlight their key contributions. Later, we discuss the
main shortcomings identified in the existing metrics and describe the
methodology used to evaluate evaluation metrics. Finally, we discuss our
suggestions and recommendations on the next steps forward to improve the
automatic evaluation metrics.
Related papers
- Is Reference Necessary in the Evaluation of NLG Systems? When and Where? [58.52957222172377]
We show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality.
Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
arXiv Detail & Related papers (2024-03-21T10:31:11Z) - LLM-based NLG Evaluation: Current Status and Challenges [41.69249290537395]
evaluating natural language generation (NLG) is a vital but challenging problem in artificial intelligence.
Large language models (LLMs) have demonstrated great potential in NLG evaluation in recent years.
Various automatic evaluation methods based on LLMs have been proposed.
arXiv Detail & Related papers (2024-02-02T13:06:35Z) - Leveraging Large Language Models for NLG Evaluation: Advances and Challenges [57.88520765782177]
Large Language Models (LLMs) have opened new avenues for assessing generated content quality, e.g., coherence, creativity, and context relevance.
We propose a coherent taxonomy for organizing existing LLM-based evaluation metrics, offering a structured framework to understand and compare these methods.
By discussing unresolved challenges, including bias, robustness, domain-specificity, and unified evaluation, this paper seeks to offer insights to researchers and advocate for fairer and more advanced NLG evaluation techniques.
arXiv Detail & Related papers (2024-01-13T15:59:09Z) - G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs.
We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z) - Towards a Unified Multi-Dimensional Evaluator for Text Generation [101.47008809623202]
We propose a unified multi-dimensional evaluator UniEval for Natural Language Generation (NLG)
We re-frame NLG evaluation as a Boolean Question Answering (QA) task, and by guiding the model with different questions, we can use one evaluator to evaluate from multiple dimensions.
Experiments on three typical NLG tasks show that UniEval correlates substantially better with human judgments than existing metrics.
arXiv Detail & Related papers (2022-10-13T17:17:03Z) - Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation
Practices for Generated Text [23.119724118572538]
Evaluation practices in natural language generation (NLG) have many known flaws, but improved evaluation approaches are rarely widely adopted.
This paper surveys the issues with human and automatic model evaluations and with commonly used datasets in NLG.
arXiv Detail & Related papers (2022-02-14T18:51:07Z) - The GEM Benchmark: Natural Language Generation, its Evaluation and
Metrics [66.96150429230035]
We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics.
Regular updates to the benchmark will help NLG research become more multilingual and evolve the challenge alongside models.
arXiv Detail & Related papers (2021-02-02T18:42:05Z) - Evaluation of Text Generation: A Survey [107.62760642328455]
The paper surveys evaluation methods of natural language generation systems that have been developed in the last few years.
We group NLG evaluation methods into three categories: (1) human-centric evaluation metrics, (2) automatic metrics that require no training, and (3) machine-learned metrics.
arXiv Detail & Related papers (2020-06-26T04:52:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.