A Fair and In-Depth Evaluation of Existing End-to-End Entity Linking
Systems
- URL: http://arxiv.org/abs/2305.14937v2
- Date: Fri, 17 Nov 2023 15:28:00 GMT
- Title: A Fair and In-Depth Evaluation of Existing End-to-End Entity Linking
Systems
- Authors: Hannah Bast and Matthias Hertel and Natalie Prange
- Abstract summary: evaluations of entity linking systems often say little about how the system is going to perform for a particular application.
We provide a more meaningful and fair in-depth evaluation of a variety of existing end-to-end entity linkers.
Our evaluation is based on several widely used benchmarks, which exhibit the problems mentioned above to various degrees, as well as on two new benchmarks.
- Score: 4.4351901934764975
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing evaluations of entity linking systems often say little about how the
system is going to perform for a particular application. There are two
fundamental reasons for this. One is that many evaluations only use aggregate
measures (like precision, recall, and F1 score), without a detailed error
analysis or a closer look at the results. The other is that all of the widely
used benchmarks have strong biases and artifacts, in particular: a strong focus
on named entities, an unclear or missing specification of what else counts as
an entity mention, poor handling of ambiguities, and an over- or
underrepresentation of certain kinds of entities.
We provide a more meaningful and fair in-depth evaluation of a variety of
existing end-to-end entity linkers. We characterize their strengths and
weaknesses and also report on reproducibility aspects. The detailed results of
our evaluation can be inspected under
https://elevant.cs.uni-freiburg.de/emnlp2023 . Our evaluation is based on
several widely used benchmarks, which exhibit the problems mentioned above to
various degrees, as well as on two new benchmarks, which address the problems
mentioned above. The new benchmarks can be found under
https://github.com/ad-freiburg/fair-entity-linking-benchmarks .
Related papers
- Is Reference Necessary in the Evaluation of NLG Systems? When and Where? [58.52957222172377]
We show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality.
Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
arXiv Detail & Related papers (2024-03-21T10:31:11Z) - Revisiting Evaluation Metrics for Semantic Segmentation: Optimization
and Evaluation of Fine-grained Intersection over Union [113.20223082664681]
We propose the use of fine-grained mIoUs along with corresponding worst-case metrics.
These fine-grained metrics offer less bias towards large objects, richer statistical information, and valuable insights into model and dataset auditing.
Our benchmark study highlights the necessity of not basing evaluations on a single metric and confirms that fine-grained mIoUs reduce the bias towards large objects.
arXiv Detail & Related papers (2023-10-30T03:45:15Z) - DeepfakeBench: A Comprehensive Benchmark of Deepfake Detection [55.70982767084996]
A critical yet frequently overlooked challenge in the field of deepfake detection is the lack of a standardized, unified, comprehensive benchmark.
We present the first comprehensive benchmark for deepfake detection, called DeepfakeBench, which offers three key contributions.
DeepfakeBench contains 15 state-of-the-art detection methods, 9CL datasets, a series of deepfake detection evaluation protocols and analysis tools, as well as comprehensive evaluations.
arXiv Detail & Related papers (2023-07-04T01:34:41Z) - Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References [123.39034752499076]
Div-Ref is a method to enhance evaluation benchmarks by enriching the number of references.
We conduct experiments to empirically demonstrate that diversifying the expression of reference can significantly enhance the correlation between automatic evaluation and human evaluation.
arXiv Detail & Related papers (2023-05-24T11:53:29Z) - Entity Disambiguation with Entity Definitions [50.01142092276296]
Local models have recently attained astounding performances in Entity Disambiguation (ED)
Previous works limited their studies to using, as the textual representation of each candidate, only its Wikipedia title.
In this paper, we address this limitation and investigate to what extent more expressive textual representations can mitigate it.
We report a new state of the art on 2 out of 6 benchmarks we consider and strongly improve the generalization capability over unseen patterns.
arXiv Detail & Related papers (2022-10-11T17:46:28Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - Robustness Evaluation of Entity Disambiguation Using Prior Probes:the
Case of Entity Overshadowing [11.513083693564466]
We evaluate and report the performance of popular entity linking systems on the ShadowLink benchmark.
Results show a considerable difference in accuracy between more and less common entities for all of the EL systems under evaluation.
arXiv Detail & Related papers (2021-08-24T20:54:56Z) - A Critical Assessment of State-of-the-Art in Entity Alignment [1.7725414095035827]
We investigate two state-of-the-art (SotA) methods for the task of Entity Alignment in Knowledge Graphs.
We first carefully examine the benchmarking process and identify several shortcomings, which make the results reported in the original works not always comparable.
arXiv Detail & Related papers (2020-10-30T15:09:19Z) - Interpretable Meta-Measure for Model Performance [4.91155110560629]
We introduce a new meta-score assessment named Elo-based Predictive Power (EPP)
EPP is built on top of other performance measures and allows for interpretable comparisons of models.
We prove the mathematical properties of EPP and support them with empirical results of a large scale benchmark on 30 classification data sets and a real-world benchmark for visual data.
arXiv Detail & Related papers (2020-06-02T14:10:13Z) - ESBM: An Entity Summarization BenchMark [20.293900908253544]
We create an Entity Summarization BenchMark (ESBM) which overcomes the limitations of existing benchmarks and meets standard desiderata for a benchmark.
Considering all of these systems are unsupervised, we also implement and evaluate a supervised learning based system for reference.
arXiv Detail & Related papers (2020-03-08T07:12:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.