Weisfeiler-Leman in the BAMBOO: Novel AMR Graph Metrics and a Benchmark
for AMR Graph Similarity
- URL: http://arxiv.org/abs/2108.11949v1
- Date: Thu, 26 Aug 2021 17:58:54 GMT
- Title: Weisfeiler-Leman in the BAMBOO: Novel AMR Graph Metrics and a Benchmark
for AMR Graph Similarity
- Authors: Juri Opitz and Angel Daza and Anette Frank
- Abstract summary: We propose new AMR similarity metrics that unify the strengths of previous metrics, while mitigating their weaknesses.
Specifically, our new metrics are able to match contextualized substructures and induce n:m alignments between their nodes.
We introduce a Benchmark for AMR Metrics based on Overt Objectives (BAMBOO) to support empirical assessment of graph-based MR similarity metrics.
- Score: 12.375561840897742
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Several metrics have been proposed for assessing the similarity of (abstract)
meaning representations (AMRs), but little is known about how they relate to
human similarity ratings. Moreover, the current metrics have complementary
strengths and weaknesses: some emphasize speed, while others make the alignment
of graph structures explicit, at the price of a costly alignment step.
In this work we propose new Weisfeiler-Leman AMR similarity metrics that
unify the strengths of previous metrics, while mitigating their weaknesses.
Specifically, our new metrics are able to match contextualized substructures
and induce n:m alignments between their nodes. Furthermore, we introduce a
Benchmark for AMR Metrics based on Overt Objectives (BAMBOO), the first
benchmark to support empirical assessment of graph-based MR similarity metrics.
BAMBOO maximizes the interpretability of results by defining multiple overt
objectives that range from sentence similarity objectives to stress tests that
probe a metric's robustness against meaning-altering and meaning-preserving
graph transformations. We show the benefits of BAMBOO by profiling previous
metrics and our own metrics. Results indicate that our novel metrics may serve
as a strong baseline for future work.
Related papers
- Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In! [80.3129093617928]
Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics.
This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings.
We introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the meta-evaluation process's accuracy, robustness, and fairness.
arXiv Detail & Related papers (2024-08-25T13:29:34Z) - Rematch: Robust and Efficient Matching of Local Knowledge Graphs to Improve Structural and Semantic Similarity [6.1980259703476674]
We introduce a novel AMR similarity metric, rematch, alongside a new evaluation for structural similarity called RARE.
Rematch ranks second in structural similarity; and first in semantic similarity by 1--5 percentage points on the STS-B and SICK-R benchmarks.
arXiv Detail & Related papers (2024-04-02T17:33:00Z) - Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged.
In this paper, we study if there are any deficiencies in reference-free metrics.
We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z) - Goodhart's Law Applies to NLP's Explanation Benchmarks [57.26445915212884]
We critically examine two sets of metrics: the ERASER metrics (comprehensiveness and sufficiency) and the EVAL-X metrics.
We show that we can inflate a model's comprehensiveness and sufficiency scores dramatically without altering its predictions or explanations on in-distribution test inputs.
Our results raise doubts about the ability of current metrics to guide explainability research, underscoring the need for a broader reassessment of what precisely these metrics are intended to capture.
arXiv Detail & Related papers (2023-08-28T03:03:03Z) - Towards Multiple References Era -- Addressing Data Leakage and Limited
Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks.
Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations.
We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z) - Joint Metrics Matter: A Better Standard for Trajectory Forecasting [67.1375677218281]
Multi-modal trajectory forecasting methods evaluate using single-agent metrics (marginal metrics)
Only focusing on marginal metrics can lead to unnatural predictions, such as colliding trajectories or diverging trajectories for people who are clearly walking together as a group.
We present the first comprehensive evaluation of state-of-the-art trajectory forecasting methods with respect to multi-agent metrics (joint metrics): JADE, JFDE, and collision rate.
arXiv Detail & Related papers (2023-05-10T16:27:55Z) - MENLI: Robust Evaluation Metrics from Natural Language Inference [26.53850343633923]
Recently proposed BERT-based evaluation metrics for text generation perform well on standard benchmarks but are vulnerable to adversarial attacks.
We develop evaluation metrics based on Natural Language Inference (NLI)
We show that our NLI based metrics are much more robust to the attacks than the recent BERT-based metrics.
arXiv Detail & Related papers (2022-08-15T16:30:14Z) - SBERT studies Meaning Representations: Decomposing Sentence Embeddings
into Explainable AMR Meaning Features [22.8438857884398]
We create similarity metrics that are highly effective, while also providing an interpretable rationale for their rating.
Our approach works in two steps: We first select AMR graph metrics that measure meaning similarity of sentences with respect to key semantic facets.
Second, we employ these metrics to induce Semantically Structured Sentence BERT embeddings, which are composed of different meaning aspects captured in different sub-spaces.
arXiv Detail & Related papers (2022-06-14T17:37:18Z) - A Unified Framework for Rank-based Evaluation Metrics for Link
Prediction in Knowledge Graphs [19.822126244784133]
Link prediction task on knowledge graphs without explicit negative triples motivates the usage of rank-based metrics.
We introduce a simple theoretical framework for rank-based metrics upon which we investigate two avenues for improvements to existing metrics via alternative aggregation functions and concepts from probability theory.
We propose several new rank-based metrics that are more easily interpreted and compared accompanied by a demonstration of their usage in a benchmarking of knowledge graph embedding models.
arXiv Detail & Related papers (2022-03-14T23:09:46Z) - REAM$\sharp$: An Enhancement Approach to Reference-based Evaluation
Metrics for Open-domain Dialog Generation [63.46331073232526]
We present an enhancement approach to Reference-based EvAluation Metrics for open-domain dialogue systems.
A prediction model is designed to estimate the reliability of the given reference set.
We show how its predicted results can be helpful to augment the reference set, and thus improve the reliability of the metric.
arXiv Detail & Related papers (2021-05-30T10:04:13Z) - AMR Similarity Metrics from Principles [21.915057426589748]
We establish criteria that enable researchers to perform a principled assessment of metrics comparing meaning representations like AMR.
We propose a novel metric S$2$match that is more benevolent to only very slight meaning deviations and targets the fulfilment of all established criteria.
arXiv Detail & Related papers (2020-01-29T16:19:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.