Does Recommend-Revise Produce Reliable Annotations? An Analysis on
Missing Instances in DocRED
- URL: http://arxiv.org/abs/2204.07980v1
- Date: Sun, 17 Apr 2022 11:29:01 GMT
- Title: Does Recommend-Revise Produce Reliable Annotations? An Analysis on
Missing Instances in DocRED
- Authors: Quzhe Huang, Shibo Hao, Yuan Ye, Shengqi Zhu, Yansong Feng, Dongyan
Zhao
- Abstract summary: We show that a textit-revise scheme results in false negative samples and an obvious bias towards popular entities and relations.
The relabeled dataset is released to serve as a more reliable test set of document RE models.
- Score: 60.39125850987604
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: DocRED is a widely used dataset for document-level relation extraction. In
the large-scale annotation, a \textit{recommend-revise} scheme is adopted to
reduce the workload. Within this scheme, annotators are provided with candidate
relation instances from distant supervision, and they then manually supplement
and remove relational facts based on the recommendations. However, when
comparing DocRED with a subset relabeled from scratch, we find that this scheme
results in a considerable amount of false negative samples and an obvious bias
towards popular entities and relations. Furthermore, we observe that the models
trained on DocRED have low recall on our relabeled dataset and inherit the same
bias in the training data. Through the analysis of annotators' behaviors, we
figure out the underlying reason for the problems above: the scheme actually
discourages annotators from supplementing adequate instances in the revision
phase. We appeal to future research to take into consideration the issues with
the recommend-revise scheme when designing new models and annotation schemes.
The relabeled dataset is released at
\url{https://github.com/AndrewZhe/Revisit-DocRED}, to serve as a more reliable
test set of document RE models.
Related papers
- Beyond the Numbers: Transparency in Relation Extraction Benchmark Creation and Leaderboards [5.632231145349045]
This paper investigates the transparency in the creation of benchmarks and the use of leaderboards for measuring progress in NLP.
Existing relation extraction benchmarks often suffer from insufficient documentation and lack crucial details.
While our discussion centers on the transparency of RE benchmarks and leaderboards, the observations we discuss are broadly applicable to other NLP tasks as well.
arXiv Detail & Related papers (2024-11-07T22:36:19Z) - Consistent Document-Level Relation Extraction via Counterfactuals [47.75615221596254]
It has been shown that document-level relation extraction models trained on real-world data suffer from factual biases.
We present CovEReD, a dataset of document-level counterfactual data for document extraction.
We show that by generating document-level counterfactual data with CovEReD models on them, consistency is maintained.
arXiv Detail & Related papers (2024-07-09T09:21:55Z) - RaFe: Ranking Feedback Improves Query Rewriting for RAG [83.24385658573198]
We propose a framework for training query rewriting models free of annotations.
By leveraging a publicly available reranker, oursprovides feedback aligned well with the rewriting objectives.
arXiv Detail & Related papers (2024-05-23T11:00:19Z) - Fact Checking Beyond Training Set [64.88575826304024]
We show that the retriever-reader suffers from performance deterioration when it is trained on labeled data from one domain and used in another domain.
We propose an adversarial algorithm to make the retriever component robust against distribution shift.
We then construct eight fact checking scenarios from these datasets, and compare our model to a set of strong baseline models.
arXiv Detail & Related papers (2024-03-27T15:15:14Z) - Class-Adaptive Self-Training for Relation Extraction with Incompletely
Annotated Training Data [43.46328487543664]
Relation extraction (RE) aims to extract relations from sentences and documents.
Recent studies showed that many RE datasets are incompletely annotated.
This is known as the false negative problem in which valid relations are falsely annotated as 'no_relation'
arXiv Detail & Related papers (2023-06-16T09:01:45Z) - Revisiting DocRED -- Addressing the False Negative Problem in Relation
Extraction [39.78594332093083]
We re-annotate 4,053 documents in the DocRED dataset by adding the missed relation triples back to the original DocRED.
We conduct extensive experiments with state-of-the-art neural models on both datasets, and the experimental results show that the models trained and evaluated on our Re-DocRED achieve performance improvements of around 13 F1 points.
arXiv Detail & Related papers (2022-05-25T11:54:48Z) - Efficient Few-Shot Fine-Tuning for Opinion Summarization [83.76460801568092]
Abstractive summarization models are typically pre-trained on large amounts of generic texts, then fine-tuned on tens or hundreds of thousands of annotated samples.
We show that a few-shot method based on adapters can easily store in-domain knowledge.
We show that this self-supervised adapter pre-training improves summary quality over standard fine-tuning by 2.0 and 1.3 ROUGE-L points on the Amazon and Yelp datasets.
arXiv Detail & Related papers (2022-05-04T16:38:37Z) - Document-Level Relation Extraction with Reconstruction [28.593318203728963]
We propose a novel encoder-classifier-reconstructor model for document-level relation extraction (DocRE)
The reconstructor reconstructs the ground-truth path dependencies from the graph representation, to ensure that the proposed DocRE model pays more attention to encode entity pairs with relationships in the training.
Experimental results on a large-scale DocRE dataset show that the proposed model can significantly improve the accuracy of relation extraction on a strong heterogeneous graph-based baseline.
arXiv Detail & Related papers (2020-12-21T14:29:31Z) - Evaluating Models' Local Decision Boundaries via Contrast Sets [119.38387782979474]
We propose a new annotation paradigm for NLP that helps to close systematic gaps in the test data.
We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets.
Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets.
arXiv Detail & Related papers (2020-04-06T14:47:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.