Longtonotes: OntoNotes with Longer Coreference Chains
- URL: http://arxiv.org/abs/2210.03650v1
- Date: Fri, 7 Oct 2022 15:58:41 GMT
- Title: Longtonotes: OntoNotes with Longer Coreference Chains
- Authors: Kumar Shridhar, Nicholas Monath, Raghuveer Thirukovalluru, Alessandro
Stolfo, Manzil Zaheer, Andrew McCallum, Mrinmaya Sachan
- Abstract summary: We build a corpus of coreference-annotated documents of significantly longer length than what is currently available.
The resulting corpus, which we call LongtoNotes, contains documents in multiple genres of the English language with varying lengths.
We evaluate state-of-the-art neural coreference systems on this new corpus.
- Score: 111.73115731999793
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Ontonotes has served as the most important benchmark for coreference
resolution. However, for ease of annotation, several long documents in
Ontonotes were split into smaller parts. In this work, we build a corpus of
coreference-annotated documents of significantly longer length than what is
currently available. We do so by providing an accurate, manually-curated,
merging of annotations from documents that were split into multiple parts in
the original Ontonotes annotation process. The resulting corpus, which we call
LongtoNotes contains documents in multiple genres of the English language with
varying lengths, the longest of which are up to 8x the length of documents in
Ontonotes, and 2x those in Litbank. We evaluate state-of-the-art neural
coreference systems on this new corpus, analyze the relationships between model
architectures/hyperparameters and document length on performance and efficiency
of the models, and demonstrate areas of improvement in long-document
coreference modeling revealed by our new corpus. Our data and code is available
at: https://github.com/kumar-shridhar/LongtoNotes.
Related papers
- DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR)
While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context.
Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z) - Leveraging BERT Language Model for Arabic Long Document Classification [0.47138177023764655]
We propose two models to classify long length Arabic documents.
Both of our models outperform the Longformer and RoBERT in this task over two different datasets.
arXiv Detail & Related papers (2023-05-04T13:56:32Z) - Fine-Grained Distillation for Long Document Retrieval [86.39802110609062]
Long document retrieval aims to fetch query-relevant documents from a large-scale collection.
Knowledge distillation has become de facto to improve a retriever by mimicking a heterogeneous yet powerful cross-encoder.
We propose a new learning framework, fine-grained distillation (FGD), for long-document retrievers.
arXiv Detail & Related papers (2022-12-20T17:00:36Z) - LawngNLI: A Long-Premise Benchmark for In-Domain Generalization from
Short to Long Contexts and for Implication-Based Retrieval [72.4859717204905]
LawngNLI is constructed from U.S. legal opinions with automatic labels with high human-validated accuracy.
It can benchmark for in-domain generalization from short to long contexts.
LawngNLI can train and test systems for implication-based case retrieval and argumentation.
arXiv Detail & Related papers (2022-12-06T18:42:39Z) - Aggregating Crowdsourced and Automatic Judgments to Scale Up a Corpus of
Anaphoric Reference for Fiction and Wikipedia Texts [16.42217979543271]
This paper introduces a new release of a corpus for anaphoric reference labelled via a game-with-a-purpose.
It is comparable in size to the largest existing corpora for anaphoric reference due in part to substantial activity by the players.
The proposed method could be adopted to greatly speed up annotation time in other projects involving games-with-a-purpose.
arXiv Detail & Related papers (2022-10-11T16:13:57Z) - Long Document Summarization with Top-down and Bottom-up Inference [113.29319668246407]
We propose a principled inference framework to improve summarization models on two aspects.
Our framework assumes a hierarchical latent structure of a document where the top-level captures the long range dependency.
We demonstrate the effectiveness of the proposed framework on a diverse set of summarization datasets.
arXiv Detail & Related papers (2022-03-15T01:24:51Z) - On Generating Extended Summaries of Long Documents [16.149617108647707]
We present a new method for generating extended summaries of long papers.
Our method exploits hierarchical structure of the documents and incorporates it into an extractive summarization model.
Our analysis shows that our multi-tasking approach can adjust extraction probability distribution to the favor of summary-worthy sentences.
arXiv Detail & Related papers (2020-12-28T08:10:28Z) - Learning to Ignore: Long Document Coreference with Bounded Memory Neural
Networks [65.3963282551994]
We argue that keeping all entities in memory is unnecessary, and we propose a memory-augmented neural network that tracks only a small bounded number of entities at a time.
We show that (a) the model remains competitive with models with high memory and computational requirements on OntoNotes and LitBank, and (b) the model learns an efficient memory management strategy easily outperforming a rule-based strategy.
arXiv Detail & Related papers (2020-10-06T15:16:31Z) - Fine-Grained Relevance Annotations for Multi-Task Document Ranking and
Question Answering [9.480648914353035]
We present FiRA: a novel dataset of Fine-Grained Relevances.
We extend the ranked retrieval annotations of the Deep Learning track of TREC 2019 with passage and word level graded relevance annotations for all relevant documents.
As an example, we evaluate the recently introduced TKL document ranking model. We find that although TKL exhibits state-of-the-art retrieval results for long documents, it misses many relevant passages.
arXiv Detail & Related papers (2020-08-12T14:59:50Z) - Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical
Encoder for Long-Form Document Matching [28.190001111358438]
We propose a Siamese Multi-depth Transformer-based SMITH for long-form document matching.
Our model contains several innovations to adapt self-attention models for longer text input.
We will open source a Wikipedia based benchmark dataset, code and a pre-trained checkpoint to accelerate future research on long-form document matching.
arXiv Detail & Related papers (2020-04-26T07:04:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.