Better Smatch = Better Parser? AMR evaluation is not so simple anymore
- URL: http://arxiv.org/abs/2210.06461v1
- Date: Wed, 12 Oct 2022 17:57:48 GMT
- Title: Better Smatch = Better Parser? AMR evaluation is not so simple anymore
- Authors: Juri Opitz and Anette Frank
- Abstract summary: We conduct an analysis of two popular and strong AMRs that reach quality levels on par with human IAA.
Considering high-performances, better Smatch scores may not necessarily indicate consistently better parsing quality.
- Score: 22.8438857884398
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, astonishing advances have been observed in AMR parsing, as measured
by the structural Smatch metric. In fact, today's systems achieve performance
levels that seem to surpass estimates of human inter annotator agreement (IAA).
Therefore, it is unclear how well Smatch (still) relates to human estimates of
parse quality, as in this situation potentially fine-grained errors of similar
weight may impact the AMR's meaning to different degrees.
We conduct an analysis of two popular and strong AMR parsers that --
according to Smatch -- reach quality levels on par with human IAA, and assess
how human quality ratings relate to Smatch and other AMR metrics. Our main
findings are: i) While high Smatch scores indicate otherwise, we find that AMR
parsing is far from being solved: we frequently find structurally small, but
semantically unacceptable errors that substantially distort sentence meaning.
ii) Considering high-performance parsers, better Smatch scores may not
necessarily indicate consistently better parsing quality. To obtain a
meaningful and comprehensive assessment of quality differences of parse(r)s, we
recommend augmenting evaluations with macro statistics, use of additional
metrics, and more human analysis.
Related papers
- Beyond correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and LLM-as-a-judge [51.93909886542317]
We show how *relying on a single aggregate correlation score* can obscure fundamental differences between human behavior and automatic evaluation methods.
We propose stratifying results by human label uncertainty to provide a more robust analysis of automatic evaluation performance.
arXiv Detail & Related papers (2024-10-03T03:08:29Z) - Rematch: Robust and Efficient Matching of Local Knowledge Graphs to Improve Structural and Semantic Similarity [6.1980259703476674]
We introduce a novel AMR similarity metric, rematch, alongside a new evaluation for structural similarity called RARE.
Rematch ranks second in structural similarity; and first in semantic similarity by 1--5 percentage points on the STS-B and SICK-R benchmarks.
arXiv Detail & Related papers (2024-04-02T17:33:00Z) - AMR Parsing is Far from Solved: GrAPES, the Granular AMR Parsing
Evaluation Suite [18.674172788583967]
Granular AMR Parsing Evaluation Suite (GrAPES)
We present the Granular AMR Parsing Evaluation Suite (GrAPES)
GrAPES reveals in depth the abilities and shortcomings of current AMRs.
arXiv Detail & Related papers (2023-12-06T13:19:56Z) - Towards Multiple References Era -- Addressing Data Leakage and Limited
Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks.
Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations.
We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - Retrofitting Multilingual Sentence Embeddings with Abstract Meaning
Representation [70.58243648754507]
We introduce a new method to improve existing multilingual sentence embeddings with Abstract Meaning Representation (AMR)
Compared with the original textual input, AMR is a structured semantic representation that presents the core concepts and relations in a sentence explicitly and unambiguously.
Experiment results show that retrofitting multilingual sentence embeddings with AMR leads to better state-of-the-art performance on both semantic similarity and transfer tasks.
arXiv Detail & Related papers (2022-10-18T11:37:36Z) - SBERT studies Meaning Representations: Decomposing Sentence Embeddings
into Explainable AMR Meaning Features [22.8438857884398]
We create similarity metrics that are highly effective, while also providing an interpretable rationale for their rating.
Our approach works in two steps: We first select AMR graph metrics that measure meaning similarity of sentences with respect to key semantic facets.
Second, we employ these metrics to induce Semantically Structured Sentence BERT embeddings, which are composed of different meaning aspects captured in different sub-spaces.
arXiv Detail & Related papers (2022-06-14T17:37:18Z) - Re-Examining System-Level Correlations of Automatic Summarization
Evaluation Metrics [64.81682222169113]
How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations.
We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice.
arXiv Detail & Related papers (2022-04-21T15:52:14Z) - Probabilistic, Structure-Aware Algorithms for Improved Variety,
Accuracy, and Coverage of AMR Alignments [9.74672460306765]
We present algorithms for aligning components of Abstract Meaning Representation (AMR) spans in English sentences.
We leverage unsupervised learning in combination with graphs, taking the best of both worlds from previous AMR.
Our approach covers a wider variety of AMR substructures than previously considered, achieves higher coverage of nodes and edges, and does so with higher accuracy.
arXiv Detail & Related papers (2021-06-10T18:46:32Z) - A Statistical Analysis of Summarization Evaluation Metrics using
Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are.
Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z) - AMR Similarity Metrics from Principles [21.915057426589748]
We establish criteria that enable researchers to perform a principled assessment of metrics comparing meaning representations like AMR.
We propose a novel metric S$2$match that is more benevolent to only very slight meaning deviations and targets the fulfilment of all established criteria.
arXiv Detail & Related papers (2020-01-29T16:19:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.