Let's Stop Incorrect Comparisons in End-to-end Relation Extraction!
- URL: http://arxiv.org/abs/2009.10684v3
- Date: Mon, 9 Aug 2021 12:43:27 GMT
- Title: Let's Stop Incorrect Comparisons in End-to-end Relation Extraction!
- Authors: Bruno Taill\'e, Vincent Guigue, Geoffrey Scoutheeten and Patrick
Gallinari
- Abstract summary: We first identify several patterns of invalid comparisons in published papers and describe them to avoid their propagation.
We then propose a small empirical study to quantify the impact of the most common mistake and evaluate it leads to overestimating the final RE performance by around 5% on ACE05.
- Score: 13.207968737733196
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite efforts to distinguish three different evaluation setups (Bekoulis et
al., 2018), numerous end-to-end Relation Extraction (RE) articles present
unreliable performance comparison to previous work. In this paper, we first
identify several patterns of invalid comparisons in published papers and
describe them to avoid their propagation. We then propose a small empirical
study to quantify the impact of the most common mistake and evaluate it leads
to overestimating the final RE performance by around 5% on ACE05. We also seize
this opportunity to study the unexplored ablations of two recent developments:
the use of language model pretraining (specifically BERT) and span-level NER.
This meta-analysis emphasizes the need for rigor in the report of both the
evaluation setting and the datasets statistics and we call for unifying the
evaluation setting in end-to-end RE.
Related papers
- Beyond Exact Match: Semantically Reassessing Event Extraction by Large Language Models [65.8478860180793]
Event extraction has gained extensive research attention due to its broad range of applications.
Current evaluation method for event extraction relies on token-level exact match.
We propose a reliable and semantic evaluation framework for event extraction, named RAEE.
arXiv Detail & Related papers (2024-10-12T07:54:01Z) - Revisiting Relation Extraction in the era of Large Language Models [24.33660998599006]
Relation extraction (RE) is the core NLP task of inferring semantic relationships between entities from text.
Recent work has instead treated the problem as a emphsequence-to-sequence task, linearizing relations between entities as target strings to be generated conditioned on the input.
Here we push the limits of this approach, using larger language models (GPT-3 and Flan-T5 large) than considered in prior work and evaluating their performance on standard RE tasks under varying levels of supervision.
arXiv Detail & Related papers (2023-05-08T19:19:07Z) - Accounting for multiplicity in machine learning benchmark performance [0.0]
Using the highest-ranked performance as an estimate for state-of-the-art (SOTA) performance is a biased estimator, giving overly optimistic results.
In this article, we provide a probability distribution for the case of multiple classifiers so that known analyses methods can be engaged and a better SOTA estimate can be provided.
arXiv Detail & Related papers (2023-03-10T10:32:18Z) - Enriching Relation Extraction with OpenIE [70.52564277675056]
Relation extraction (RE) is a sub-discipline of information extraction (IE)
In this work, we explore how recent approaches for open information extraction (OpenIE) may help to improve the task of RE.
Our experiments over two annotated corpora, KnowledgeNet and FewRel, demonstrate the improved accuracy of our enriched models.
arXiv Detail & Related papers (2022-12-19T11:26:23Z) - What do You Mean by Relation Extraction? A Survey on Datasets and Study
on Scientific Relation Classification [21.513743126525622]
We present an empirical study on scientific Relation Classification across two datasets.
Despite large data overlap, our analysis reveals substantial discrepancies in annotation.
Variation within further sub-domains exists but impacts Relation Classification only limited degrees.
arXiv Detail & Related papers (2022-04-28T14:07:25Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - Just Rank: Rethinking Evaluation with Word and Sentence Similarities [105.5541653811528]
intrinsic evaluation for embeddings lags far behind, and there has been no significant update since the past decade.
This paper first points out the problems using semantic similarity as the gold standard for word and sentence embedding evaluations.
We propose a new intrinsic evaluation method called EvalRank, which shows a much stronger correlation with downstream tasks.
arXiv Detail & Related papers (2022-03-05T08:40:05Z) - Impact of Pretraining Term Frequencies on Few-Shot Reasoning [51.990349528930125]
We investigate how well pretrained language models reason with terms that are less frequent in the pretraining data.
We measure the strength of this correlation for a number of GPT-based language models on various numerical deduction tasks.
Although LMs exhibit strong performance at few-shot numerical reasoning tasks, our results raise the question of how much models actually generalize beyond pretraining data.
arXiv Detail & Related papers (2022-02-15T05:43:54Z) - Reenvisioning Collaborative Filtering vs Matrix Factorization [65.74881520196762]
Collaborative filtering models based on matrix factorization and learned similarities using Artificial Neural Networks (ANNs) have gained significant attention in recent years.
Announcement of ANNs within the recommendation ecosystem has been recently questioned, raising several comparisons in terms of efficiency and effectiveness.
We show the potential these techniques may have on beyond-accuracy evaluation while analyzing effect on complementary evaluation dimensions.
arXiv Detail & Related papers (2021-07-28T16:29:38Z) - Re-TACRED: Addressing Shortcomings of the TACRED Dataset [5.820381428297218]
TACRED is one of the largest and most widely used sentence-level relation extraction datasets.
Proposed models that are evaluated using this dataset consistently set new state-of-the-art performance.
However, they still exhibit large error rates despite leveraging external knowledge and unsupervised pretraining on large text corpora.
arXiv Detail & Related papers (2021-04-16T22:55:11Z) - A Statistical Analysis of Summarization Evaluation Metrics using
Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are.
Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.