Manual Evaluation Matters: Reviewing Test Protocols of Distantly
Supervised Relation Extraction
- URL: http://arxiv.org/abs/2105.09543v1
- Date: Thu, 20 May 2021 06:55:40 GMT
- Title: Manual Evaluation Matters: Reviewing Test Protocols of Distantly
Supervised Relation Extraction
- Authors: Tianyu Gao, Xu Han, Keyue Qiu, Yuzhuo Bai, Zhiyu Xie, Yankai Lin,
Zhiyuan Liu, Peng Li, Maosong Sun, Jie Zhou
- Abstract summary: We build manually-annotated test sets for two DS-RE datasets, NYT10 and Wiki20, and thoroughly evaluate several competitive models.
Results show that the manual evaluation can indicate very different conclusions from automatic ones.
- Score: 61.48964753725744
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Distantly supervised (DS) relation extraction (RE) has attracted much
attention in the past few years as it can utilize large-scale auto-labeled
data. However, its evaluation has long been a problem: previous works either
took costly and inconsistent methods to manually examine a small sample of
model predictions, or directly test models on auto-labeled data -- which, by
our check, produce as much as 53% wrong labels at the entity pair level in the
popular NYT10 dataset. This problem has not only led to inaccurate evaluation,
but also made it hard to understand where we are and what's left to improve in
the research of DS-RE. To evaluate DS-RE models in a more credible way, we
build manually-annotated test sets for two DS-RE datasets, NYT10 and Wiki20,
and thoroughly evaluate several competitive models, especially the latest
pre-trained ones. The experimental results show that the manual evaluation can
indicate very different conclusions from automatic ones, especially some
unexpected observations, e.g., pre-trained models can achieve dominating
performance while being more susceptible to false-positives compared to
previous methods. We hope that both our manual test sets and novel observations
can help advance future DS-RE research.
Related papers
- Towards Robust Visual Question Answering: Making the Most of Biased
Samples via Contrastive Learning [54.61762276179205]
We propose a novel contrastive learning approach, MMBS, for building robust VQA models by Making the Most of Biased Samples.
Specifically, we construct positive samples for contrastive learning by eliminating the information related to spurious correlation from the original training samples.
We validate our contributions by achieving competitive performance on the OOD dataset VQA-CP v2 while preserving robust performance on the ID dataset VQA v2.
arXiv Detail & Related papers (2022-10-10T11:05:21Z) - Confidence-Guided Data Augmentation for Deep Semi-Supervised Training [0.9968241071319184]
We propose a new data augmentation technique for semi-supervised learning settings that emphasizes learning from the most challenging regions of the feature space.
We perform experiments on two benchmark RGB datasets: CIFAR-100 and STL-10, and show that the proposed scheme improves classification performance in terms of accuracy and robustness.
arXiv Detail & Related papers (2022-09-16T21:23:19Z) - A Principled Evaluation Protocol for Comparative Investigation of the
Effectiveness of DNN Classification Models on Similar-but-non-identical
Datasets [11.735794237408427]
We show that Deep Neural Network (DNN) models show significant, consistent, and largely unexplained degradation in accuracy on replication test datasets.
We propose a principled evaluation protocol that is suitable for performing comparative investigations of the accuracy of a DNN model on multiple test datasets.
Our experimental results indicate that the observed accuracy degradation between established benchmark datasets and their replications is consistently lower.
arXiv Detail & Related papers (2022-09-05T09:14:43Z) - TTAPS: Test-Time Adaption by Aligning Prototypes using Self-Supervision [70.05605071885914]
We propose a novel modification of the self-supervised training algorithm SwAV that adds the ability to adapt to single test samples.
We show the success of our method on the common benchmark dataset CIFAR10-C.
arXiv Detail & Related papers (2022-05-18T05:43:06Z) - Efficient Test-Time Model Adaptation without Forgetting [60.36499845014649]
Test-time adaptation seeks to tackle potential distribution shifts between training and testing data.
We propose an active sample selection criterion to identify reliable and non-redundant samples.
We also introduce a Fisher regularizer to constrain important model parameters from drastic changes.
arXiv Detail & Related papers (2022-04-06T06:39:40Z) - The MultiBERTs: BERT Reproductions for Robustness Analysis [86.29162676103385]
Re-running pretraining can lead to substantially different conclusions about performance.
We introduce MultiBERTs: a set of 25 BERT-base checkpoints.
The aim is to enable researchers to draw robust and statistically justified conclusions about pretraining procedures.
arXiv Detail & Related papers (2021-06-30T15:56:44Z) - A Systematic Evaluation of Transfer Learning and Pseudo-labeling with
BERT-based Ranking Models [2.0498977512661267]
We evaluate transferability of BERT-based neural ranking models across five English datasets.
Each of our collections has a substantial number of queries, which enables a full-shot evaluation mode.
We find that training on pseudo-labels can produce a competitive or better model compared to transfer learning.
arXiv Detail & Related papers (2021-03-04T21:08:06Z) - Towards Accurate and Consistent Evaluation: A Dataset for
Distantly-Supervised Relation Extraction [14.958043759503658]
We build a new dataset NYTH, where we use the DS-generated data as training data and hire annotators to label test data.
Compared with the previous datasets, NYT-H has a much larger test set and then we can perform more accurate and consistent evaluation.
The experimental results show that the ranking lists of the comparison systems on the DS-labelled test data and human-annotated test data are different.
arXiv Detail & Related papers (2020-10-30T13:52:52Z) - Evaluating Models' Local Decision Boundaries via Contrast Sets [119.38387782979474]
We propose a new annotation paradigm for NLP that helps to close systematic gaps in the test data.
We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets.
Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets.
arXiv Detail & Related papers (2020-04-06T14:47:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.