Coherence-Based Distributed Document Representation Learning for
Scientific Documents
- URL: http://arxiv.org/abs/2201.02846v1
- Date: Sat, 8 Jan 2022 15:29:21 GMT
- Title: Coherence-Based Distributed Document Representation Learning for
Scientific Documents
- Authors: Shicheng Tan, Shu Zhao, Yanping Zhang
- Abstract summary: We propose a coupled text pair embedding (CTPE) model to learn the representation of scientific documents.
We use negative sampling to construct uncoupled text pairs whose two parts are from different documents.
We train the model to judge whether the text pair is coupled or uncoupled and use the obtained embedding of coupled text pairs as the embedding of documents.
- Score: 9.646001537050925
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Distributed document representation is one of the basic problems in natural
language processing. Currently distributed document representation methods
mainly consider the context information of words or sentences. These methods do
not take into account the coherence of the document as a whole, e.g., a
relation between the paper title and abstract, headline and description, or
adjacent bodies in the document. The coherence shows whether a document is
meaningful, both logically and syntactically, especially in scientific
documents (papers or patents, etc.). In this paper, we propose a coupled text
pair embedding (CTPE) model to learn the representation of scientific
documents, which maintains the coherence of the document with coupled text
pairs formed by segmenting the document. First, we divide the document into two
parts (e.g., title and abstract, etc) which construct a coupled text pair.
Then, we adopt negative sampling to construct uncoupled text pairs whose two
parts are from different documents. Finally, we train the model to judge
whether the text pair is coupled or uncoupled and use the obtained embedding of
coupled text pairs as the embedding of documents. We perform experiments on
three datasets for one information retrieval task and two recommendation tasks.
The experimental results verify the effectiveness of the proposed CTPE model.
Related papers
- Unified Multi-Modal Interleaved Document Representation for Information Retrieval [57.65409208879344]
We produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities.
Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation.
arXiv Detail & Related papers (2024-10-03T17:49:09Z) - Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - Knowledge-Driven Cross-Document Relation Extraction [3.868708275322908]
Relation extraction (RE) is a well-known NLP application often treated as a sentence- or document-level task.
We propose a novel approach, KXDocRE, that embed domain knowledge of entities with input text for cross-document RE.
arXiv Detail & Related papers (2024-05-22T11:30:59Z) - PropSegmEnt: A Large-Scale Corpus for Proposition-Level Segmentation and
Entailment Recognition [63.51569687229681]
We argue for the need to recognize the textual entailment relation of each proposition in a sentence individually.
We propose PropSegmEnt, a corpus of over 45K propositions annotated by expert human raters.
Our dataset structure resembles the tasks of (1) segmenting sentences within a document to the set of propositions, and (2) classifying the entailment relation of each proposition with respect to a different yet topically-aligned document.
arXiv Detail & Related papers (2022-12-21T04:03:33Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - Multi-Vector Models with Textual Guidance for Fine-Grained Scientific
Document Similarity [11.157086694203201]
We present a new scientific document similarity model based on matching fine-grained aspects.
Our model is trained using co-citation contexts that describe related paper aspects as a novel form of textual supervision.
arXiv Detail & Related papers (2021-11-16T11:12:30Z) - Eider: Evidence-enhanced Document-level Relation Extraction [56.71004595444816]
Document-level relation extraction (DocRE) aims at extracting semantic relations among entity pairs in a document.
We propose a three-stage evidence-enhanced DocRE framework consisting of joint relation and evidence extraction, evidence-centered relation extraction (RE), and fusion of extraction results.
arXiv Detail & Related papers (2021-06-16T09:43:16Z) - Multilevel Text Alignment with Cross-Document Attention [59.76351805607481]
Existing alignment methods operate at a single, predefined level.
We propose a new learning approach that equips previously established hierarchical attention encoders for representing documents with a cross-document attention component.
arXiv Detail & Related papers (2020-10-03T02:52:28Z) - Pairwise Multi-Class Document Classification for Semantic Relations
between Wikipedia Articles [5.40541521227338]
We model the problem of finding the relationship between two documents as a pairwise document classification task.
To find semantic relation between documents, we apply a series of techniques, such as GloVe, paragraph-s, BERT, and XLNet.
We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations.
arXiv Detail & Related papers (2020-03-22T12:52:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.