Coherence-Based Distributed Document Representation Learning for
Scientific Documents
- URL: http://arxiv.org/abs/2201.02846v1
- Date: Sat, 8 Jan 2022 15:29:21 GMT
- Title: Coherence-Based Distributed Document Representation Learning for
Scientific Documents
- Authors: Shicheng Tan, Shu Zhao, Yanping Zhang
- Abstract summary: We propose a coupled text pair embedding (CTPE) model to learn the representation of scientific documents.
We use negative sampling to construct uncoupled text pairs whose two parts are from different documents.
We train the model to judge whether the text pair is coupled or uncoupled and use the obtained embedding of coupled text pairs as the embedding of documents.
- Score: 9.646001537050925
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Distributed document representation is one of the basic problems in natural
language processing. Currently distributed document representation methods
mainly consider the context information of words or sentences. These methods do
not take into account the coherence of the document as a whole, e.g., a
relation between the paper title and abstract, headline and description, or
adjacent bodies in the document. The coherence shows whether a document is
meaningful, both logically and syntactically, especially in scientific
documents (papers or patents, etc.). In this paper, we propose a coupled text
pair embedding (CTPE) model to learn the representation of scientific
documents, which maintains the coherence of the document with coupled text
pairs formed by segmenting the document. First, we divide the document into two
parts (e.g., title and abstract, etc) which construct a coupled text pair.
Then, we adopt negative sampling to construct uncoupled text pairs whose two
parts are from different documents. Finally, we train the model to judge
whether the text pair is coupled or uncoupled and use the obtained embedding of
coupled text pairs as the embedding of documents. We perform experiments on
three datasets for one information retrieval task and two recommendation tasks.
The experimental results verify the effectiveness of the proposed CTPE model.
Related papers
- Knowledge-Driven Cross-Document Relation Extraction [3.868708275322908]
Relation extraction (RE) is a well-known NLP application often treated as a sentence- or document-level task.
We propose a novel approach, KXDocRE, that embed domain knowledge of entities with input text for cross-document RE.
arXiv Detail & Related papers (2024-05-22T11:30:59Z) - PropSegmEnt: A Large-Scale Corpus for Proposition-Level Segmentation and
Entailment Recognition [63.51569687229681]
We argue for the need to recognize the textual entailment relation of each proposition in a sentence individually.
We propose PropSegmEnt, a corpus of over 45K propositions annotated by expert human raters.
Our dataset structure resembles the tasks of (1) segmenting sentences within a document to the set of propositions, and (2) classifying the entailment relation of each proposition with respect to a different yet topically-aligned document.
arXiv Detail & Related papers (2022-12-21T04:03:33Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - Specialized Document Embeddings for Aspect-based Similarity of Research
Papers [4.661692753666685]
We treat aspect-based similarity as a classical vector similarity problem in aspect-specific embedding spaces.
We represent a document not as a single generic embedding but as multiple specialized embeddings.
Our approach mitigates potential risks arising from implicit biases by making them explicit.
arXiv Detail & Related papers (2022-03-28T07:35:26Z) - Bilingual Topic Models for Comparable Corpora [9.509416095106491]
We propose a binding mechanism between the distributions of the paired documents.
To estimate the similarity of documents that are written in different languages we use cross-lingual word embeddings that are learned with shallow neural networks.
We evaluate the proposed binding mechanism by extending two topic models: a bilingual adaptation of LDA that assumes bag-of-words inputs and a model that incorporates part of the text structure in the form of boundaries of semantically coherent segments.
arXiv Detail & Related papers (2021-11-30T10:53:41Z) - Multi-Vector Models with Textual Guidance for Fine-Grained Scientific
Document Similarity [11.157086694203201]
We present a new scientific document similarity model based on matching fine-grained aspects.
Our model is trained using co-citation contexts that describe related paper aspects as a novel form of textual supervision.
arXiv Detail & Related papers (2021-11-16T11:12:30Z) - Eider: Evidence-enhanced Document-level Relation Extraction [56.71004595444816]
Document-level relation extraction (DocRE) aims at extracting semantic relations among entity pairs in a document.
We propose a three-stage evidence-enhanced DocRE framework consisting of joint relation and evidence extraction, evidence-centered relation extraction (RE), and fusion of extraction results.
arXiv Detail & Related papers (2021-06-16T09:43:16Z) - Pairwise Representation Learning for Event Coreference [73.10563168692667]
We develop a Pairwise Representation Learning (PairwiseRL) scheme for the event mention pairs.
Our representation supports a finer, structured representation of the text snippet to facilitate encoding events and their arguments.
We show that PairwiseRL, despite its simplicity, outperforms the prior state-of-the-art event coreference systems on both cross-document and within-document event coreference benchmarks.
arXiv Detail & Related papers (2020-10-24T06:55:52Z) - Multilevel Text Alignment with Cross-Document Attention [59.76351805607481]
Existing alignment methods operate at a single, predefined level.
We propose a new learning approach that equips previously established hierarchical attention encoders for representing documents with a cross-document attention component.
arXiv Detail & Related papers (2020-10-03T02:52:28Z) - Pairwise Multi-Class Document Classification for Semantic Relations
between Wikipedia Articles [5.40541521227338]
We model the problem of finding the relationship between two documents as a pairwise document classification task.
To find semantic relation between documents, we apply a series of techniques, such as GloVe, paragraph-s, BERT, and XLNet.
We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations.
arXiv Detail & Related papers (2020-03-22T12:52:56Z) - Explaining Relationships Between Scientific Documents [55.23390424044378]
We address the task of explaining relationships between two scientific documents using natural language text.
In this paper we establish a dataset of 622K examples from 154K documents.
arXiv Detail & Related papers (2020-02-02T03:54:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.