Towards Accurate and Consistent Evaluation: A Dataset for
Distantly-Supervised Relation Extraction
- URL: http://arxiv.org/abs/2010.16275v1
- Date: Fri, 30 Oct 2020 13:52:52 GMT
- Title: Towards Accurate and Consistent Evaluation: A Dataset for
Distantly-Supervised Relation Extraction
- Authors: Tong Zhu, Haitao Wang, Junjie Yu, Xiabing Zhou, Wenliang Chen, Wei
Zhang, Min Zhang
- Abstract summary: We build a new dataset NYTH, where we use the DS-generated data as training data and hire annotators to label test data.
Compared with the previous datasets, NYT-H has a much larger test set and then we can perform more accurate and consistent evaluation.
The experimental results show that the ranking lists of the comparison systems on the DS-labelled test data and human-annotated test data are different.
- Score: 14.958043759503658
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, distantly-supervised relation extraction has achieved a
certain success by using deep neural networks. Distant Supervision (DS) can
automatically generate large-scale annotated data by aligning entity pairs from
Knowledge Bases (KB) to sentences. However, these DS-generated datasets
inevitably have wrong labels that result in incorrect evaluation scores during
testing, which may mislead the researchers. To solve this problem, we build a
new dataset NYTH, where we use the DS-generated data as training data and hire
annotators to label test data. Compared with the previous datasets, NYT-H has a
much larger test set and then we can perform more accurate and consistent
evaluation. Finally, we present the experimental results of several widely used
systems on NYT-H. The experimental results show that the ranking lists of the
comparison systems on the DS-labelled test data and human-annotated test data
are different. This indicates that our human-annotated data is necessary for
evaluation of distantly-supervised relation extraction.
Related papers
- Conditional Semi-Supervised Data Augmentation for Spam Message Detection with Low Resource Data [0.0]
We propose a conditional semi-supervised data augmentation for a spam detection model lacking the availability of data.
We exploit unlabeled data for data augmentation to extend training data.
Latent variables can come from labeled and unlabeled data as the input for the final classifier.
arXiv Detail & Related papers (2024-07-06T07:51:24Z) - Measuring and Improving Attentiveness to Partial Inputs with Counterfactuals [91.59906995214209]
We propose a new evaluation method, Counterfactual Attentiveness Test (CAT)
CAT uses counterfactuals by replacing part of the input with its counterpart from a different example, expecting an attentive model to change its prediction.
We show that GPT3 becomes less attentive with an increased number of demonstrations, while its accuracy on the test data improves.
arXiv Detail & Related papers (2023-11-16T06:27:35Z) - Dataset Distillation: A Comprehensive Review [76.26276286545284]
dataset distillation (DD) aims to derive a much smaller dataset containing synthetic samples, based on which the trained models yield performance comparable with those trained on the original dataset.
This paper gives a comprehensive review and summary of recent advances in DD and its application.
arXiv Detail & Related papers (2023-01-17T17:03:28Z) - A Principled Evaluation Protocol for Comparative Investigation of the
Effectiveness of DNN Classification Models on Similar-but-non-identical
Datasets [11.735794237408427]
We show that Deep Neural Network (DNN) models show significant, consistent, and largely unexplained degradation in accuracy on replication test datasets.
We propose a principled evaluation protocol that is suitable for performing comparative investigations of the accuracy of a DNN model on multiple test datasets.
Our experimental results indicate that the observed accuracy degradation between established benchmark datasets and their replications is consistently lower.
arXiv Detail & Related papers (2022-09-05T09:14:43Z) - Vector-Based Data Improves Left-Right Eye-Tracking Classifier
Performance After a Covariate Distributional Shift [0.0]
We propose a fine-grain data approach for EEG-ET data collection in order to create more robust benchmarking.
We train machine learning models utilizing both coarse-grain and fine-grain data and compare their accuracies when tested on data of similar/different distributional patterns.
Results showed that models trained on fine-grain, vector-based data were less susceptible to distributional shifts than models trained on coarse-grain, binary-classified data.
arXiv Detail & Related papers (2022-07-31T16:27:50Z) - Anomaly Detection with Test Time Augmentation and Consistency Evaluation [13.709281244889691]
We propose a simple, yet effective anomaly detection algorithm named Test Time Augmentation Anomaly Detection (TTA-AD)
We observe that in-distribution data enjoy more consistent predictions for its original and augmented versions on a trained network than out-distribution data.
Experiments on various high-resolution image benchmark datasets demonstrate that TTA-AD achieves comparable or better detection performance.
arXiv Detail & Related papers (2022-06-06T04:27:06Z) - Data-SUITE: Data-centric identification of in-distribution incongruous
examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data.
We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z) - Manual Evaluation Matters: Reviewing Test Protocols of Distantly
Supervised Relation Extraction [61.48964753725744]
We build manually-annotated test sets for two DS-RE datasets, NYT10 and Wiki20, and thoroughly evaluate several competitive models.
Results show that the manual evaluation can indicate very different conclusions from automatic ones.
arXiv Detail & Related papers (2021-05-20T06:55:40Z) - HYDRA: Hypergradient Data Relevance Analysis for Interpreting Deep
Neural Networks [51.143054943431665]
We propose Hypergradient Data Relevance Analysis, or HYDRA, which interprets predictions made by deep neural networks (DNNs) as effects of their training data.
HYDRA assesses the contribution of training data toward test data points throughout the training trajectory.
In addition, we quantitatively demonstrate that HYDRA outperforms influence functions in accurately estimating data contribution and detecting noisy data labels.
arXiv Detail & Related papers (2021-02-04T10:00:13Z) - The Gap on GAP: Tackling the Problem of Differing Data Distributions in
Bias-Measuring Datasets [58.53269361115974]
Diagnostic datasets that can detect biased models are an important prerequisite for bias reduction within natural language processing.
undesired patterns in the collected data can make such tests incorrect.
We introduce a theoretically grounded method for weighting test samples to cope with such patterns in the test data.
arXiv Detail & Related papers (2020-11-03T16:50:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.