Re-TACRED: Addressing Shortcomings of the TACRED Dataset
- URL: http://arxiv.org/abs/2104.08398v1
- Date: Fri, 16 Apr 2021 22:55:11 GMT
- Title: Re-TACRED: Addressing Shortcomings of the TACRED Dataset
- Authors: George Stoica, Emmanouil Antonios Platanios, Barnab\'as P\'oczos
- Abstract summary: TACRED is one of the largest and most widely used sentence-level relation extraction datasets.
Proposed models that are evaluated using this dataset consistently set new state-of-the-art performance.
However, they still exhibit large error rates despite leveraging external knowledge and unsupervised pretraining on large text corpora.
- Score: 5.820381428297218
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: TACRED is one of the largest and most widely used sentence-level relation
extraction datasets. Proposed models that are evaluated using this dataset
consistently set new state-of-the-art performance. However, they still exhibit
large error rates despite leveraging external knowledge and unsupervised
pretraining on large text corpora. A recent study suggested that this may be
due to poor dataset quality. The study observed that over 50% of the most
challenging sentences from the development and test sets are incorrectly
labeled and account for an average drop of 8% f1-score in model performance.
However, this study was limited to a small biased sample of 5k (out of a total
of 106k) sentences, substantially restricting the generalizability and broader
implications of its findings. In this paper, we address these shortcomings by:
(i) performing a comprehensive study over the whole TACRED dataset, (ii)
proposing an improved crowdsourcing strategy and deploying it to re-annotate
the whole dataset, and (iii) performing a thorough analysis to understand how
correcting the TACRED annotations affects previously published results. After
verification, we observed that 23.9% of TACRED labels are incorrect. Moreover,
evaluating several models on our revised dataset yields an average f1-score
improvement of 14.3% and helps uncover significant relationships between the
different models (rather than simply offsetting or scaling their scores by a
constant factor). Finally, aside from our analysis we also release Re-TACRED, a
new completely re-annotated version of the TACRED dataset that can be used to
perform reliable evaluation of relation extraction models.
Related papers
- Preserving Knowledge Invariance: Rethinking Robustness Evaluation of
Open Information Extraction [50.62245481416744]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.
We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.
By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z) - Unbiased Supervised Contrastive Learning [10.728852691100338]
In this work, we tackle the problem of learning representations that are robust to biases.
We first present a margin-based theoretical framework that allows us to clarify why recent contrastive losses can fail when dealing with biased data.
We derive a novel formulation of the supervised contrastive loss (epsilon-SupInfoNCE), providing more accurate control of the minimal distance between positive and negative samples.
Thanks to our theoretical framework, we also propose FairKL, a new debiasing regularization loss, that works well even with extremely biased data.
arXiv Detail & Related papers (2022-11-10T13:44:57Z) - Assaying Out-Of-Distribution Generalization in Transfer Learning [103.57862972967273]
We take a unified view of previous work, highlighting message discrepancies that we address empirically.
We fine-tune over 31k networks, from nine different architectures in the many- and few-shot setting.
arXiv Detail & Related papers (2022-07-19T12:52:33Z) - CrossAug: A Contrastive Data Augmentation Method for Debiasing Fact
Verification Models [14.75693099720436]
We propose CrossAug, a contrastive data augmentation method for debiasing fact verification models.
We employ a two-stage augmentation pipeline to generate new claims and evidences from existing samples.
The generated samples are then paired cross-wise with the original pair, forming contrastive samples that facilitate the model to rely less on spurious patterns.
arXiv Detail & Related papers (2021-09-30T13:19:19Z) - BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information
Retrieval Models [41.45240621979654]
We introduce BEIR, a heterogeneous benchmark for information retrieval.
We study the effectiveness of nine state-of-the-art retrieval models in a zero-shot evaluation setup.
Dense-retrieval models are computationally more efficient but often underperform other approaches.
arXiv Detail & Related papers (2021-04-17T23:29:55Z) - Evaluating Prediction-Time Batch Normalization for Robustness under
Covariate Shift [81.74795324629712]
We call prediction-time batch normalization, which significantly improves model accuracy and calibration under covariate shift.
We show that prediction-time batch normalization provides complementary benefits to existing state-of-the-art approaches for improving robustness.
The method has mixed results when used alongside pre-training, and does not seem to perform as well under more natural types of dataset shift.
arXiv Detail & Related papers (2020-06-19T05:08:43Z) - Identifying Statistical Bias in Dataset Replication [102.92137353938388]
We study a replication of the ImageNet dataset on which models exhibit a significant (11-14%) drop in accuracy.
After correcting for the identified statistical bias, only an estimated $3.6% pm 1.5%$ of the original $11.7% pm 1.0%$ accuracy drop remains unaccounted for.
arXiv Detail & Related papers (2020-05-19T17:48:32Z) - TACRED Revisited: A Thorough Evaluation of the TACRED Relation
Extraction Task [80.38130122127882]
TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE)
In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement?
We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
arXiv Detail & Related papers (2020-04-30T15:07:37Z) - Adversarial Filters of Dataset Biases [96.090959788952]
Large neural models have demonstrated human-level performance on language and vision benchmarks.
Their performance degrades considerably on adversarial or out-of-distribution samples.
We propose AFLite, which adversarially filters such dataset biases.
arXiv Detail & Related papers (2020-02-10T21:59:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.