Ranking vs. Classifying: Measuring Knowledge Base Completion Quality
- URL: http://arxiv.org/abs/2102.06145v1
- Date: Tue, 2 Feb 2021 17:53:48 GMT
- Title: Ranking vs. Classifying: Measuring Knowledge Base Completion Quality
- Authors: Marina Speranskaya, Martin Schmitt, Benjamin Roth
- Abstract summary: We argue that consideration of binary predictions is essential to reflect the actual KBC quality.
We simulate the realistic scenario of real-world entities missing from a KB.
We evaluate a number of state-of-the-art KB embeddings models on our new benchmark.
- Score: 10.06803520598035
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge base completion (KBC) methods aim at inferring missing facts from
the information present in a knowledge base (KB) by estimating the likelihood
of candidate facts. In the prevailing evaluation paradigm, models do not
actually decide whether a new fact should be accepted or not but are solely
judged on the position of true facts in a likelihood ranking with other
candidates. We argue that consideration of binary predictions is essential to
reflect the actual KBC quality, and propose a novel evaluation paradigm,
designed to provide more transparent model selection criteria for a realistic
scenario. We construct the data set FB14k-QAQ where instead of single facts, we
use KB queries, i.e., facts where one entity is replaced with a variable, and
construct corresponding sets of entities that are correct answers. We randomly
remove some of these correct answers from the data set, simulating the
realistic scenario of real-world entities missing from a KB. This way, we can
explicitly measure a model's ability to handle queries that have more correct
answers in the real world than in the KB, including the special case of queries
without any valid answer. The latter especially contrasts the ranking setting.
We evaluate a number of state-of-the-art KB embeddings models on our new
benchmark. The differences in relative performance between ranking-based and
classification-based evaluation that we observe in our experiments confirm our
hypothesis that good performance on the ranking task does not necessarily
translate to good performance on the actual completion task. Our results
motivate future work on KB embedding models with better prediction separability
and, as a first step in that direction, we propose a simple variant of TransE
that encourages thresholding and achieves a significant improvement in
classification F1 score relative to the original TransE.
Related papers
- When is an Embedding Model More Promising than Another? [33.540506562970776]
Embedders play a central role in machine learning, projecting any object into numerical representations that can be leveraged to perform various downstream tasks.
The evaluation of embedding models typically depends on domain-specific empirical approaches.
We present a unified approach to evaluate embedders, drawing upon the concepts of sufficiency and informativeness.
arXiv Detail & Related papers (2024-06-11T18:13:46Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - Preserving Knowledge Invariance: Rethinking Robustness Evaluation of
Open Information Extraction [50.62245481416744]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.
We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.
By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z) - Exploring Partial Knowledge Base Inference in Biomedical Entity Linking [0.4798394926736971]
We name this scenario partial knowledge base inference.
We construct benchmarks and witness a catastrophic degradation in EL performance due to dramatically precision drop.
We propose two simple-and-effective redemption methods to combat the NIL issue with little computational overhead.
arXiv Detail & Related papers (2023-03-18T04:31:07Z) - Uncertainty-based Network for Few-shot Image Classification [17.912365063048263]
We propose Uncertainty-Based Network, which models the uncertainty of classification results with the help of mutual information.
We show that Uncertainty-Based Network achieves comparable performance in classification accuracy compared to state-of-the-art method.
arXiv Detail & Related papers (2022-05-17T07:49:32Z) - Investigating Crowdsourcing Protocols for Evaluating the Factual
Consistency of Summaries [59.27273928454995]
Current pre-trained models applied to summarization are prone to factual inconsistencies which misrepresent the source text or introduce extraneous information.
We create a crowdsourcing evaluation framework for factual consistency using the rating-based Likert scale and ranking-based Best-Worst Scaling protocols.
We find that ranking-based protocols offer a more reliable measure of summary quality across datasets, while the reliability of Likert ratings depends on the target dataset and the evaluation design.
arXiv Detail & Related papers (2021-09-19T19:05:00Z) - Benchmarking Commonsense Knowledge Base Population with an Effective
Evaluation Dataset [37.02104430195374]
Reasoning over commonsense knowledge bases (CSKB) whose elements are in the form of free-text is an important yet hard task in NLP.
We benchmark the CSKB population task with a new large-scale dataset.
We also propose a novel inductive commonsense reasoning model that reasons over graphs.
arXiv Detail & Related papers (2021-09-16T02:50:01Z) - Beyond I.I.D.: Three Levels of Generalization for Question Answering on
Knowledge Bases [63.43418760818188]
We release a new large-scale, high-quality dataset with 64,331 questions, GrailQA.
We propose a novel BERT-based KBQA model.
The combination of our dataset and model enables us to thoroughly examine and demonstrate, for the first time, the key role of pre-trained contextual embeddings like BERT in the generalization of KBQA.
arXiv Detail & Related papers (2020-11-16T06:36:26Z) - Probabilistic Case-based Reasoning for Open-World Knowledge Graph
Completion [59.549664231655726]
A case-based reasoning (CBR) system solves a new problem by retrieving cases' that are similar to the given problem.
In this paper, we demonstrate that such a system is achievable for reasoning in knowledge-bases (KBs)
Our approach predicts attributes for an entity by gathering reasoning paths from similar entities in the KB.
arXiv Detail & Related papers (2020-10-07T17:48:12Z) - Pre-training Is (Almost) All You Need: An Application to Commonsense
Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks.
We introduce a new scoring method that casts a plausibility ranking task in a full-text format.
We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.