TACRED Revisited: A Thorough Evaluation of the TACRED Relation
Extraction Task
- URL: http://arxiv.org/abs/2004.14855v1
- Date: Thu, 30 Apr 2020 15:07:37 GMT
- Title: TACRED Revisited: A Thorough Evaluation of the TACRED Relation
Extraction Task
- Authors: Christoph Alt, Aleksandra Gabryszak, Leonhard Hennig
- Abstract summary: TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE)
In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement?
We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
- Score: 80.38130122127882
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: TACRED (Zhang et al., 2017) is one of the largest, most widely used
crowdsourced datasets in Relation Extraction (RE). But, even with recent
advances in unsupervised pre-training and knowledge enhanced neural RE, models
still show a high error rate. In this paper, we investigate the questions: Have
we reached a performance ceiling or is there still room for improvement? And
how do crowd annotations, dataset, and models contribute to this error rate? To
answer these questions, we first validate the most challenging 5K examples in
the development and test sets using trained annotators. We find that label
errors account for 8% absolute F1 test error, and that more than 50% of the
examples need to be relabeled. On the relabeled test set the average F1 score
of a large baseline model set improves from 62.1 to 70.1. After validation, we
analyze misclassifications on the challenging instances, categorize them into
linguistically motivated error groups, and verify the resulting error
hypotheses on three state-of-the-art RE models. We show that two groups of
ambiguous relations are responsible for most of the remaining errors and that
models may adopt shallow heuristics on the dataset when entities are not
masked.
Related papers
- Subtle Errors Matter: Preference Learning via Error-injected Self-editing [59.405145971637204]
We propose a novel preference learning framework called eRror-Injected Self-Editing (RISE)
RISE injects predefined subtle errors into partial tokens of correct solutions to construct hard pairs for error mitigation.
Experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH.
arXiv Detail & Related papers (2024-10-09T07:43:38Z) - Zero-shot Retrieval: Augmenting Pre-trained Models with Search Engines [83.65380507372483]
Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box.
This paper shows how to leverage recent advances in NLP and multi-modal learning to augment a pre-trained model with search engine retrieval.
arXiv Detail & Related papers (2023-11-29T05:33:28Z) - Automated Classification of Model Errors on ImageNet [7.455546102930913]
We propose an automated error classification framework to study how modeling choices affect error distributions.
We use our framework to comprehensively evaluate the error distribution of over 900 models.
In particular, we observe that the portion of severe errors drops significantly with top-1 accuracy indicating that, while it underreports a model's true performance, it remains a valuable performance metric.
arXiv Detail & Related papers (2023-11-13T20:41:39Z) - LARA: A Light and Anti-overfitting Retraining Approach for Unsupervised
Time Series Anomaly Detection [49.52429991848581]
We propose a Light and Anti-overfitting Retraining Approach (LARA) for deep variational auto-encoder based time series anomaly detection methods (VAEs)
This work aims to make three novel contributions: 1) the retraining process is formulated as a convex problem and can converge at a fast rate as well as prevent overfitting; 2) designing a ruminate block, which leverages the historical data without the need to store them; and 3) mathematically proving that when fine-tuning the latent vector and reconstructed data, the linear formations can achieve the least adjusting errors between the ground truths and the fine-tuned ones.
arXiv Detail & Related papers (2023-10-09T12:36:16Z) - Class-Adaptive Self-Training for Relation Extraction with Incompletely
Annotated Training Data [43.46328487543664]
Relation extraction (RE) aims to extract relations from sentences and documents.
Recent studies showed that many RE datasets are incompletely annotated.
This is known as the false negative problem in which valid relations are falsely annotated as 'no_relation'
arXiv Detail & Related papers (2023-06-16T09:01:45Z) - Annotating and Detecting Fine-grained Factual Errors for Dialogue
Summarization [34.85353544844499]
We present the first dataset with fine-grained factual error annotations named DIASUMFACT.
We define fine-grained factual error detection as a sentence-level multi-label classification problem.
We propose an unsupervised model ENDERANKER via candidate ranking using pretrained encoder-decoder models.
arXiv Detail & Related papers (2023-05-26T00:18:33Z) - Certifying Data-Bias Robustness in Linear Regression [12.00314910031517]
We present a technique for certifying whether linear regression models are pointwise-robust to label bias in a training dataset.
We show how to solve this problem exactly for individual test points, and provide an approximate but more scalable method.
We also unearth gaps in bias-robustness, such as high levels of non-robustness for certain bias assumptions on some datasets.
arXiv Detail & Related papers (2022-06-07T20:47:07Z) - DapStep: Deep Assignee Prediction for Stack Trace Error rePresentation [61.99379022383108]
We propose new deep learning models to solve the bug triage problem.
The models are based on a bidirectional recurrent neural network with attention and on a convolutional neural network.
To improve the quality of ranking, we propose using additional information from version control system annotations.
arXiv Detail & Related papers (2022-01-14T00:16:57Z) - Re-TACRED: Addressing Shortcomings of the TACRED Dataset [5.820381428297218]
TACRED is one of the largest and most widely used sentence-level relation extraction datasets.
Proposed models that are evaluated using this dataset consistently set new state-of-the-art performance.
However, they still exhibit large error rates despite leveraging external knowledge and unsupervised pretraining on large text corpora.
arXiv Detail & Related papers (2021-04-16T22:55:11Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.