CD2CR: Co-reference Resolution Across Documents and Domains
- URL: http://arxiv.org/abs/2101.12637v1
- Date: Fri, 29 Jan 2021 15:18:30 GMT
- Title: CD2CR: Co-reference Resolution Across Documents and Domains
- Authors: James Ravenscroft and Arie Cattan and Amanda Clare and Ido Dagan and
Maria Liakata
- Abstract summary: Cross-document co-reference resolution (CDCR) is the task of identifying and linking mentions to entities and concepts across many text documents.
We propose a new task and English language dataset for cross-document cross-domain co-reference resolution (CD$2$CR)
We show that in this cross-domain, cross-document setting, existing CDCR models do not perform well and we provide a baseline model that outperforms current state-of-the-art CDCR models on CD$2$CR.
- Score: 20.30046972135548
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cross-document co-reference resolution (CDCR) is the task of identifying and
linking mentions to entities and concepts across many text documents. Current
state-of-the-art models for this task assume that all documents are of the same
type (e.g. news articles) or fall under the same theme. However, it is also
desirable to perform CDCR across different domains (type or theme). A
particular use case we focus on in this paper is the resolution of entities
mentioned across scientific work and newspaper articles that discuss them.
Identifying the same entities and corresponding concepts in both scientific
articles and news can help scientists understand how their work is represented
in mainstream media. We propose a new task and English language dataset for
cross-document cross-domain co-reference resolution (CD$^2$CR). The task aims
to identify links between entities across heterogeneous document types. We show
that in this cross-domain, cross-document setting, existing CDCR models do not
perform well and we provide a baseline model that outperforms current
state-of-the-art CDCR models on CD$^2$CR. Our data set, annotation tool and
guidelines as well as our model for cross-document cross-domain co-reference
are all supplied as open access open source resources.
Related papers
- Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - Knowledge-Driven Cross-Document Relation Extraction [3.868708275322908]
Relation extraction (RE) is a well-known NLP application often treated as a sentence- or document-level task.
We propose a novel approach, KXDocRE, that embed domain knowledge of entities with input text for cross-document RE.
arXiv Detail & Related papers (2024-05-22T11:30:59Z) - Leveraging Collection-Wide Similarities for Unsupervised Document Structure Extraction [61.998789448260005]
We propose to identify the typical structure of document within a collection.
We abstract over arbitrary header paraphrases, and ground each topic to respective document locations.
We develop an unsupervised graph-based method which leverages both inter- and intra-document similarities.
arXiv Detail & Related papers (2024-02-21T16:22:21Z) - On Task-personalized Multimodal Few-shot Learning for Visually-rich
Document Entity Retrieval [59.25292920967197]
Few-shot document entity retrieval (VDER) is an important topic in industrial NLP applications.
FewVEX is a new dataset to boost future research in the field of entity-level few-shot VDER.
We present a task-aware meta-learning based framework, with a central focus on achieving effective task personalization.
arXiv Detail & Related papers (2023-11-01T17:51:43Z) - Unifying Vision, Text, and Layout for Universal Document Processing [105.36490575974028]
We propose a Document AI model which unifies text, image, and layout modalities together with varied task formats, including document understanding and generation.
Our method sets the state-of-the-art on 9 Document AI tasks, e.g., document understanding and QA, across diverse data domains like finance reports, academic papers, and websites.
arXiv Detail & Related papers (2022-12-05T22:14:49Z) - Entity-centered Cross-document Relation Extraction [34.38369224008656]
Relation Extraction (RE) is a fundamental task of information extraction, which has attracted a large amount of research attention.
Previous studies focus on extracting the relations within a sentence or document, while currently researchers begin to explore cross-document RE.
In this paper, we aim to address both of these shortages and push the state-of-the-art for cross-document RE.
arXiv Detail & Related papers (2022-10-29T09:27:15Z) - Cross-document Event Coreference Search: Task, Dataset and Modeling [26.36068336169796]
We propose an appealing, and often more applicable, complementary set up for the task - Cross-document Coreference Search.
To support research on this task, we create a corresponding dataset, which is derived from Wikipedia.
We present a novel model that integrates a powerful coreference scoring scheme into the DPR architecture, yielding improved performance.
arXiv Detail & Related papers (2022-10-23T08:21:25Z) - RDU: A Region-based Approach to Form-style Document Understanding [69.29541701576858]
Key Information Extraction (KIE) is aimed at extracting structured information from form-style documents.
We develop a new KIE model named Region-based Understanding Document (RDU)
RDU takes as input the text content and corresponding coordinates of a document, and tries to predict the result by localizing a bounding-box-like region.
arXiv Detail & Related papers (2022-06-14T14:47:48Z) - SciCo: Hierarchical Cross-Document Coreference for Scientific Concepts [28.96683772139377]
We present a new task of hierarchical CDCR for concepts in scientific papers.
The goal is to jointly inferring coreference clusters and hierarchy between them.
We create SciCo, an expert-annotated dataset for this task, which is 3X larger than the prominent ECB+ resource.
arXiv Detail & Related papers (2021-04-18T10:42:20Z) - WEC: Deriving a Large-scale Cross-document Event Coreference dataset
from Wikipedia [14.324743524196874]
We present Wikipedia Event Coreference (WEC), an efficient methodology for gathering a large-scale dataset for cross-document event coreference from Wikipedia.
We apply this methodology to the English Wikipedia and extract our large-scale WEC-Eng dataset.
We develop an algorithm that adapts components of state-of-the-art models for within-document coreference resolution to the cross-document setting.
arXiv Detail & Related papers (2021-04-11T14:54:35Z) - Cross-Domain Document Object Detection: Benchmark Suite and Method [71.4339949510586]
Document object detection (DOD) is fundamental for downstream tasks like intelligent document editing and understanding.
We investigate cross-domain DOD, where the goal is to learn a detector for the target domain using labeled data from the source domain and only unlabeled data from the target domain.
For each dataset, we provide the page images, bounding box annotations, PDF files, and the rendering layers extracted from the PDF files.
arXiv Detail & Related papers (2020-03-30T03:04:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.