Related papers: Coherence-Based Distributed Document Representation Learning for Scientific Documents

Coherence-Based Distributed Document Representation Learning for Scientific Documents

URL: http://arxiv.org/abs/2201.02846v1
Date: Sat, 8 Jan 2022 15:29:21 GMT
Title: Coherence-Based Distributed Document Representation Learning for Scientific Documents
Authors: Shicheng Tan, Shu Zhao, Yanping Zhang
Abstract summary: We propose a coupled text pair embedding (CTPE) model to learn the representation of scientific documents. We use negative sampling to construct uncoupled text pairs whose two parts are from different documents. We train the model to judge whether the text pair is coupled or uncoupled and use the obtained embedding of coupled text pairs as the embedding of documents.
Score: 9.646001537050925
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Distributed document representation is one of the basic problems in natural language processing. Currently distributed document representation methods mainly consider the context information of words or sentences. These methods do not take into account the coherence of the document as a whole, e.g., a relation between the paper title and abstract, headline and description, or adjacent bodies in the document. The coherence shows whether a document is meaningful, both logically and syntactically, especially in scientific documents (papers or patents, etc.). In this paper, we propose a coupled text pair embedding (CTPE) model to learn the representation of scientific documents, which maintains the coherence of the document with coupled text pairs formed by segmenting the document. First, we divide the document into two parts (e.g., title and abstract, etc) which construct a coupled text pair. Then, we adopt negative sampling to construct uncoupled text pairs whose two parts are from different documents. Finally, we train the model to judge whether the text pair is coupled or uncoupled and use the obtained embedding of coupled text pairs as the embedding of documents. We perform experiments on three datasets for one information retrieval task and two recommendation tasks. The experimental results verify the effectiveness of the proposed CTPE model.

Related papers

Subtopic-aware View Sampling and Temporal Aggregation for Long-form Document Matching [34.81690842091582]
Long-form document matching aims to judge the relevance between two documents. We introduce a new framework to model representative matching signals. Our learning framework is effective on several document-matching tasks, including news duplication and legal case retrieval.
arXiv Detail & Related papers (2024-12-10T15:06:48Z)
Unified Multi-Modal Interleaved Document Representation for Information Retrieval [57.65409208879344]
We produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities. Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation.
arXiv Detail & Related papers (2024-10-03T17:49:09Z)
Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings. First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss. Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z)
Knowledge-Driven Cross-Document Relation Extraction [3.868708275322908]
Relation extraction (RE) is a well-known NLP application often treated as a sentence- or document-level task. We propose a novel approach, KXDocRE, that embed domain knowledge of entities with input text for cross-document RE.
arXiv Detail & Related papers (2024-05-22T11:30:59Z)
PropSegmEnt: A Large-Scale Corpus for Proposition-Level Segmentation and Entailment Recognition [63.51569687229681]
We argue for the need to recognize the textual entailment relation of each proposition in a sentence individually. We propose PropSegmEnt, a corpus of over 45K propositions annotated by expert human raters. Our dataset structure resembles the tasks of (1) segmenting sentences within a document to the set of propositions, and (2) classifying the entailment relation of each proposition with respect to a different yet topically-aligned document.
arXiv Detail & Related papers (2022-12-21T04:03:33Z)
Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding. UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input. An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z)
Multi-Vector Models with Textual Guidance for Fine-Grained Scientific Document Similarity [11.157086694203201]
We present a new scientific document similarity model based on matching fine-grained aspects. Our model is trained using co-citation contexts that describe related paper aspects as a novel form of textual supervision.
arXiv Detail & Related papers (2021-11-16T11:12:30Z)
Eider: Evidence-enhanced Document-level Relation Extraction [56.71004595444816]
Document-level relation extraction (DocRE) aims at extracting semantic relations among entity pairs in a document. We propose a three-stage evidence-enhanced DocRE framework consisting of joint relation and evidence extraction, evidence-centered relation extraction (RE), and fusion of extraction results.
arXiv Detail & Related papers (2021-06-16T09:43:16Z)
Multilevel Text Alignment with Cross-Document Attention [59.76351805607481]
Existing alignment methods operate at a single, predefined level. We propose a new learning approach that equips previously established hierarchical attention encoders for representing documents with a cross-document attention component.
arXiv Detail & Related papers (2020-10-03T02:52:28Z)
Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles [5.40541521227338]
We model the problem of finding the relationship between two documents as a pairwise document classification task. To find semantic relation between documents, we apply a series of techniques, such as GloVe, paragraph-s, BERT, and XLNet. We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations.
arXiv Detail & Related papers (2020-03-22T12:52:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.