Reference-free Evaluation Metrics for Text Generation: A Survey
- URL: http://arxiv.org/abs/2501.12011v1
- Date: Tue, 21 Jan 2025 10:05:48 GMT
- Title: Reference-free Evaluation Metrics for Text Generation: A Survey
- Authors: Takumi Ito, Kees van Deemter, Jun Suzuki,
- Abstract summary: A number of automatic evaluation metrics have been proposed for natural language generation systems.
The most common approach to automatic evaluation is the use of a reference-based metric that compares the model's output with gold-standard references written by humans.
Various reference-free metrics have been developed in recent years.
- Score: 18.512882012973005
- License:
- Abstract: A number of automatic evaluation metrics have been proposed for natural language generation systems. The most common approach to automatic evaluation is the use of a reference-based metric that compares the model's output with gold-standard references written by humans. However, it is expensive to create such references, and for some tasks, such as response generation in dialogue, creating references is not a simple matter. Therefore, various reference-free metrics have been developed in recent years. In this survey, which intends to cover the full breadth of all NLG tasks, we investigate the most commonly used approaches, their application, and their other uses beyond evaluating models. The survey concludes by highlighting some promising directions for future research.
Related papers
- Is Reference Necessary in the Evaluation of NLG Systems? When and Where? [58.52957222172377]
We show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality.
Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
arXiv Detail & Related papers (2024-03-21T10:31:11Z) - Reference-based Metrics Disprove Themselves in Question Generation [17.83616985138126]
We find that using human-written references cannot guarantee the effectiveness of reference-based metrics.
A good metric is expected to grade a human-validated question no worse than generated questions.
We propose a reference-free metric consisted of multi-dimensional criteria such as naturalness, answerability, and complexity.
arXiv Detail & Related papers (2024-03-18T20:47:10Z) - Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References [123.39034752499076]
Div-Ref is a method to enhance evaluation benchmarks by enriching the number of references.
We conduct experiments to empirically demonstrate that diversifying the expression of reference can significantly enhance the correlation between automatic evaluation and human evaluation.
arXiv Detail & Related papers (2023-05-24T11:53:29Z) - On the Limitations of Reference-Free Evaluations of Generated Text [64.81682222169113]
We show that reference-free metrics are inherently biased and limited in their ability to evaluate generated text.
We argue that they should not be used to measure progress on tasks like machine translation or summarization.
arXiv Detail & Related papers (2022-10-22T22:12:06Z) - CTRLEval: An Unsupervised Reference-Free Metric for Evaluating
Controlled Text Generation [85.03709740727867]
We propose an unsupervised reference-free metric calledEval to evaluate controlled text generation models.
Eval assembles the generation probabilities from a pre-trained language model without any model training.
Experimental results show that our metric has higher correlations with human judgments than other baselines.
arXiv Detail & Related papers (2022-04-02T13:42:49Z) - Revisiting the Evaluation Metrics of Paraphrase Generation [35.6803390044542]
Most existing paraphrase generation models use reference-based metrics to evaluate their generated paraphrase.
This paper proposes BBScore, a reference-free metric that can reflect the generated paraphrase's quality.
arXiv Detail & Related papers (2022-02-17T07:18:54Z) - Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand [117.62186420147563]
We propose a generalization of leaderboards, bidimensional leaderboards (Billboards)
Unlike conventional unidimensional leaderboards that sort submitted systems by predetermined metrics, a Billboard accepts both generators and evaluation metrics as competing entries.
We demonstrate that a linear ensemble of a few diverse metrics sometimes substantially outperforms existing metrics in isolation.
arXiv Detail & Related papers (2021-12-08T06:34:58Z) - Language Model Augmented Relevance Score [2.8314622515446835]
Language Model Augmented Relevance Score (MARS) is a new context-aware metric for NLG evaluation.
MARS uses off-the-shelf language models, guided by reinforcement learning, to create augmented references that consider both the generation context and available human references.
arXiv Detail & Related papers (2021-08-19T03:59:23Z) - REAM$\sharp$: An Enhancement Approach to Reference-based Evaluation
Metrics for Open-domain Dialog Generation [63.46331073232526]
We present an enhancement approach to Reference-based EvAluation Metrics for open-domain dialogue systems.
A prediction model is designed to estimate the reliability of the given reference set.
We show how its predicted results can be helpful to augment the reference set, and thus improve the reliability of the metric.
arXiv Detail & Related papers (2021-05-30T10:04:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.