InfoLM: A New Metric to Evaluate Summarization & Data2Text Generation
- URL: http://arxiv.org/abs/2112.01589v1
- Date: Thu, 2 Dec 2021 20:09:29 GMT
- Title: InfoLM: A New Metric to Evaluate Summarization & Data2Text Generation
- Authors: Pierre Colombo, Chloe Clave, Pablo Piantanida
- Abstract summary: InfoLM is a family of untrained metrics that can be viewed as a string-based metric.
This family of metrics also makes use of information measures allowing the adaptation of InfoLM to various evaluation criteria.
- Score: 27.129551973093008
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Assessing the quality of natural language generation systems through human
annotation is very expensive. Additionally, human annotation campaigns are
time-consuming and include non-reusable human labour. In practice, researchers
rely on automatic metrics as a proxy of quality. In the last decade, many
string-based metrics (e.g., BLEU) have been introduced. However, such metrics
usually rely on exact matches and thus, do not robustly handle synonyms. In
this paper, we introduce InfoLM a family of untrained metrics that can be
viewed as a string-based metric that addresses the aforementioned flaws thanks
to a pre-trained masked language model. This family of metrics also makes use
of information measures allowing the adaptation of InfoLM to various evaluation
criteria. Using direct assessment, we demonstrate that InfoLM achieves
statistically significant improvement and over $10$ points of correlation gains
in many configurations on both summarization and data2text generation.
Related papers
- Is Reference Necessary in the Evaluation of NLG Systems? When and Where? [58.52957222172377]
We show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality.
Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
arXiv Detail & Related papers (2024-03-21T10:31:11Z) - We Need to Talk About Classification Evaluation Metrics in NLP [34.73017509294468]
In Natural Language Processing (NLP) model generalizability is generally measured with standard metrics such as Accuracy, F-Measure, or AUC-ROC.
The diversity of metrics, and the arbitrariness of their application suggest that there is no agreement within NLP on a single best metric to use.
We demonstrate that a random-guess normalised Informedness metric is a parsimonious baseline for task performance.
arXiv Detail & Related papers (2024-01-08T11:40:48Z) - Evaluation of Faithfulness Using the Longest Supported Subsequence [52.27522262537075]
We introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous of the claim that is supported by the context.
Using a new human-annotated dataset, we finetune a model to generate Longest Supported Subsequence (LSS)
Our proposed metric demonstrates an 18% enhancement over the prevailing state-of-the-art metric for faithfulness on our dataset.
arXiv Detail & Related papers (2023-08-23T14:18:44Z) - Towards Multiple References Era -- Addressing Data Leakage and Limited
Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks.
Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations.
We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z) - Improving Metrics for Speech Translation [1.2891210250935146]
We introduce Parallel Paraphrasing ($textPara_textboth$), an augmentation method for translation metrics making use of automatic paraphrasing of both the reference and hypothesis.
We show that we are able to significantly improve the correlation with human quality perception if our method is applied to commonly used metrics.
arXiv Detail & Related papers (2023-05-22T11:01:38Z) - Not All Errors are Equal: Learning Text Generation Metrics using
Stratified Error Synthesis [79.18261352971284]
We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation.
We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings.
SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.
arXiv Detail & Related papers (2022-10-10T22:30:26Z) - The Glass Ceiling of Automatic Evaluation in Natural Language Generation [60.59732704936083]
We take a step back and analyze recent progress by comparing the body of existing automatic metrics and human metrics.
Our extensive statistical analysis reveals surprising findings: automatic metrics -- old and new -- are much more similar to each other than to humans.
arXiv Detail & Related papers (2022-08-31T01:13:46Z) - Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand [117.62186420147563]
We propose a generalization of leaderboards, bidimensional leaderboards (Billboards)
Unlike conventional unidimensional leaderboards that sort submitted systems by predetermined metrics, a Billboard accepts both generators and evaluation metrics as competing entries.
We demonstrate that a linear ensemble of a few diverse metrics sometimes substantially outperforms existing metrics in isolation.
arXiv Detail & Related papers (2021-12-08T06:34:58Z) - BEAMetrics: A Benchmark for Language Generation Evaluation Evaluation [16.81712151903078]
Natural language processing (NLP) systems are increasingly trained to generate open-ended text.
Different metrics have different strengths and biases, and reflect human intuitions better on some tasks than others.
Here, we describe the Benchmark to Evaluate Automatic Metrics (BEAMetrics) to make research into new metrics itself easier to evaluate.
arXiv Detail & Related papers (2021-10-18T10:03:19Z) - Supervised Categorical Metric Learning with Schatten p-Norms [10.995886294197412]
We propose a method, called CPML for emphcategorical projected metric learning, to address the problem of metric learning in categorical data.
We make use of the Value Distance Metric to represent our data and propose new distances based on this representation.
We then show how to efficiently learn new metrics.
arXiv Detail & Related papers (2020-02-26T01:17:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.