Aggregating Crowdsourced and Automatic Judgments to Scale Up a Corpus of
Anaphoric Reference for Fiction and Wikipedia Texts
- URL: http://arxiv.org/abs/2210.05581v1
- Date: Tue, 11 Oct 2022 16:13:57 GMT
- Title: Aggregating Crowdsourced and Automatic Judgments to Scale Up a Corpus of
Anaphoric Reference for Fiction and Wikipedia Texts
- Authors: Juntao Yu, Silviu Paun, Maris Camilleri, Paloma Carretero Garcia, Jon
Chamberlain, Udo Kruschwitz, Massimo Poesio
- Abstract summary: This paper introduces a new release of a corpus for anaphoric reference labelled via a game-with-a-purpose.
It is comparable in size to the largest existing corpora for anaphoric reference due in part to substantial activity by the players.
The proposed method could be adopted to greatly speed up annotation time in other projects involving games-with-a-purpose.
- Score: 16.42217979543271
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although several datasets annotated for anaphoric reference/coreference
exist, even the largest such datasets have limitations in terms of size, range
of domains, coverage of anaphoric phenomena, and size of documents included.
Yet, the approaches proposed to scale up anaphoric annotation haven't so far
resulted in datasets overcoming these limitations. In this paper, we introduce
a new release of a corpus for anaphoric reference labelled via a
game-with-a-purpose. This new release is comparable in size to the largest
existing corpora for anaphoric reference due in part to substantial activity by
the players, in part thanks to the use of a new resolve-and-aggregate paradigm
to 'complete' markable annotations through the combination of an anaphoric
resolver and an aggregation method for anaphoric reference. The proposed method
could be adopted to greatly speed up annotation time in other projects
involving games-with-a-purpose. In addition, the corpus covers genres for which
no comparable size datasets exist (Fiction and Wikipedia); it covers singletons
and non-referring expressions; and it includes a substantial number of long
documents (> 2K in length).
Related papers
- Lightweight Spatial Modeling for Combinatorial Information Extraction From Documents [31.434507306952458]
We propose KNN-former, which incorporates a new kind of bias in attention calculation based on the K-nearest-neighbor (KNN) graph of document entities.
We also use matching spatial to address the one-to-one mapping property that exists in many documents.
Our method is highly-efficient compared to existing approaches in terms of the number of trainable parameters.
arXiv Detail & Related papers (2024-05-08T10:10:38Z) - REXEL: An End-to-end Model for Document-Level Relation Extraction and Entity Linking [11.374031643273941]
REXEL is a highly efficient and accurate model for the joint task of document level cIE (DocIE)
It is on average 11 times faster than competitive existing approaches in a similar setting.
The combination of speed and accuracy makes REXEL an accurate cost-efficient system for extracting structured information at web-scale.
arXiv Detail & Related papers (2024-04-19T11:04:27Z) - Entity Disambiguation with Entity Definitions [50.01142092276296]
Local models have recently attained astounding performances in Entity Disambiguation (ED)
Previous works limited their studies to using, as the textual representation of each candidate, only its Wikipedia title.
In this paper, we address this limitation and investigate to what extent more expressive textual representations can mitigate it.
We report a new state of the art on 2 out of 6 benchmarks we consider and strongly improve the generalization capability over unseen patterns.
arXiv Detail & Related papers (2022-10-11T17:46:28Z) - Longtonotes: OntoNotes with Longer Coreference Chains [111.73115731999793]
We build a corpus of coreference-annotated documents of significantly longer length than what is currently available.
The resulting corpus, which we call LongtoNotes, contains documents in multiple genres of the English language with varying lengths.
We evaluate state-of-the-art neural coreference systems on this new corpus.
arXiv Detail & Related papers (2022-10-07T15:58:41Z) - Scoring Coreference Chains with Split-Antecedent Anaphors [23.843305521306227]
We propose a solution to the technical problem of generalizing existing metrics for identity anaphora so that they can also be used to score cases of split-antecedents.
This is the first such proposal in the literature on anaphora or coreference, and has been successfully used to score both split-antecedent plural references and discourse deixis.
arXiv Detail & Related papers (2022-05-24T19:07:36Z) - Autoregressive Entity Retrieval [55.38027440347138]
Entities are at the center of how we represent and aggregate knowledge.
The ability to retrieve such entities given a query is fundamental for knowledge-intensive tasks such as entity linking and open-domain question answering.
We propose GENRE, the first system that retrieves entities by generating their unique names, left to right, token-by-token in an autoregressive fashion.
arXiv Detail & Related papers (2020-10-02T10:13:31Z) - Object Detection with a Unified Label Space from Multiple Datasets [94.33205773893151]
Given multiple datasets with different label spaces, the goal of this work is to train a single object detector predicting over the union of all the label spaces.
Consider an object category like faces that is annotated in one dataset, but is not annotated in another dataset.
Some categories, like face here, would thus be considered foreground in one dataset, but background in another.
We propose loss functions that carefully integrate partial but correct annotations with complementary but noisy pseudo labels.
arXiv Detail & Related papers (2020-08-15T00:51:27Z) - Joint Multi-Dimensional Model for Global and Time-Series Annotations [48.159050222769494]
Crowdsourcing is a popular approach to collect annotations for unlabeled data instances.
It involves collecting a large number of annotations from several, often naive untrained annotators for each data instance which are then combined to estimate the ground truth.
Most annotation fusion schemes however ignore this aspect and model each dimension separately.
We propose a generative model for multi-dimensional annotation fusion, which models the dimensions jointly leading to more accurate ground truth estimates.
arXiv Detail & Related papers (2020-05-06T20:08:46Z) - Active Learning for Coreference Resolution using Discrete Annotation [76.36423696634584]
We improve upon pairwise annotation for active learning in coreference resolution.
We ask annotators to identify mention antecedents if a presented mention pair is deemed not coreferent.
In experiments with existing benchmark coreference datasets, we show that the signal from this additional question leads to significant performance gains per human-annotation hour.
arXiv Detail & Related papers (2020-04-28T17:17:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.