A Dynamic, Interpreted CheckList for Meaning-oriented NLG Metric
Evaluation -- through the Lens of Semantic Similarity Rating
- URL: http://arxiv.org/abs/2205.12176v1
- Date: Tue, 24 May 2022 16:19:32 GMT
- Title: A Dynamic, Interpreted CheckList for Meaning-oriented NLG Metric
Evaluation -- through the Lens of Semantic Similarity Rating
- Authors: Laura Zeidler, Juri Opitz and Anette Frank
- Abstract summary: We develop a CheckList for NLG metrics that is organized around meaning-relevant linguistic phenomena.
Each test instance consists of a pair of sentences with their AMR graphs and a human-produced textual semantic similarity or relatedness score.
We demonstrate the usefulness of CheckList by designing a new metric GraCo that computes lexical cohesion graphs over AMR concepts.
- Score: 19.33681537640272
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Evaluating the quality of generated text is difficult, since traditional NLG
evaluation metrics, focusing more on surface form than meaning, often fail to
assign appropriate scores. This is especially problematic for AMR-to-text
evaluation, given the abstract nature of AMR. Our work aims to support the
development and improvement of NLG evaluation metrics that focus on meaning, by
developing a dynamic CheckList for NLG metrics that is interpreted by being
organized around meaning-relevant linguistic phenomena. Each test instance
consists of a pair of sentences with their AMR graphs and a human-produced
textual semantic similarity or relatedness score. Our CheckList facilitates
comparative evaluation of metrics and reveals strengths and weaknesses of novel
and traditional metrics. We demonstrate the usefulness of CheckList by
designing a new metric GraCo that computes lexical cohesion graphs over AMR
concepts. Our analysis suggests that GraCo presents an interesting NLG metric
worth future investigation and that meaning-oriented NLG metrics can profit
from graph-based metric components using AMR.
Related papers
- Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In! [80.3129093617928]
Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics.
This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings.
We introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the meta-evaluation process's accuracy, robustness, and fairness.
arXiv Detail & Related papers (2024-08-25T13:29:34Z) - Rematch: Robust and Efficient Matching of Local Knowledge Graphs to Improve Structural and Semantic Similarity [6.1980259703476674]
We introduce a novel AMR similarity metric, rematch, alongside a new evaluation for structural similarity called RARE.
Rematch ranks second in structural similarity; and first in semantic similarity by 1--5 percentage points on the STS-B and SICK-R benchmarks.
arXiv Detail & Related papers (2024-04-02T17:33:00Z) - Is Reference Necessary in the Evaluation of NLG Systems? When and Where? [58.52957222172377]
We show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality.
Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
arXiv Detail & Related papers (2024-03-21T10:31:11Z) - Unsupervised Approach to Evaluate Sentence-Level Fluency: Do We Really
Need Reference? [3.2528685897001455]
This paper adapts an existing unsupervised technique for measuring text fluency without the need for any reference.
Our approach leverages various word embeddings and trains language models using Recurrent Neural Network (RNN) architectures.
To assess the performance of the models, we conduct a comparative analysis across 10 Indic languages.
arXiv Detail & Related papers (2023-12-03T20:09:23Z) - Towards Multiple References Era -- Addressing Data Leakage and Limited
Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks.
Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations.
We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z) - G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs.
We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z) - REV: Information-Theoretic Evaluation of Free-Text Rationales [83.24985872655738]
We argue that an ideal metric should focus on the new information uniquely provided in the rationale that is otherwise not provided in the input or the label.
We propose a metric called REV (Rationale Evaluation with conditional V-information) to quantify the amount of new, label-relevant information in a rationale.
arXiv Detail & Related papers (2022-10-10T19:31:30Z) - Weisfeiler-Leman in the BAMBOO: Novel AMR Graph Metrics and a Benchmark
for AMR Graph Similarity [12.375561840897742]
We propose new AMR similarity metrics that unify the strengths of previous metrics, while mitigating their weaknesses.
Specifically, our new metrics are able to match contextualized substructures and induce n:m alignments between their nodes.
We introduce a Benchmark for AMR Metrics based on Overt Objectives (BAMBOO) to support empirical assessment of graph-based MR similarity metrics.
arXiv Detail & Related papers (2021-08-26T17:58:54Z) - Language Model Augmented Relevance Score [2.8314622515446835]
Language Model Augmented Relevance Score (MARS) is a new context-aware metric for NLG evaluation.
MARS uses off-the-shelf language models, guided by reinforcement learning, to create augmented references that consider both the generation context and available human references.
arXiv Detail & Related papers (2021-08-19T03:59:23Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z) - Towards a Decomposable Metric for Explainable Evaluation of Text
Generation from AMR [22.8438857884398]
AMR systems are typically evaluated using metrics that compare the generated texts to reference texts from which the input meaning representations were constructed.
We show that besides well-known issues from which such metrics suffer, an additional problem arises when applying these metrics for AMR-to-text evaluation.
We show that fulfillment of both principles offers benefits for AMR-to-text evaluation, including explainability of scores.
arXiv Detail & Related papers (2020-08-20T11:25:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.