Related papers: Multilevel Text Alignment with Cross-Document Attention

Multilevel Text Alignment with Cross-Document Attention

URL: http://arxiv.org/abs/2010.01263v1
Date: Sat, 3 Oct 2020 02:52:28 GMT
Title: Multilevel Text Alignment with Cross-Document Attention
Authors: Xuhui Zhou, Nikolaos Pappas, Noah A. Smith
Abstract summary: Existing alignment methods operate at a single, predefined level. We propose a new learning approach that equips previously established hierarchical attention encoders for representing documents with a cross-document attention component.
Score: 59.76351805607481
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text alignment finds application in tasks such as citation recommendation and plagiarism detection. Existing alignment methods operate at a single, predefined level and cannot learn to align texts at, for example, sentence and document levels. We propose a new learning approach that equips previously established hierarchical attention encoders for representing documents with a cross-document attention component, enabling structural comparisons across different levels (document-to-document and sentence-to-document). Our component is weakly supervised from document pairs and can align at multiple levels. Our evaluation on predicting document-to-document relationships and sentence-to-document relationships on the tasks of citation recommendation and plagiarism detection shows that our approach outperforms previously established hierarchical, attention encoders based on recurrent and transformer contextualization that are unaware of structural correspondence between documents.

Related papers

Unified Multi-Modal Interleaved Document Representation for Information Retrieval [57.65409208879344]
We produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities. Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation.
arXiv Detail & Related papers (2024-10-03T17:49:09Z)
Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings. First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss. Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z)
Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding. UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input. An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z)
Specialized Document Embeddings for Aspect-based Similarity of Research Papers [4.661692753666685]
We treat aspect-based similarity as a classical vector similarity problem in aspect-specific embedding spaces. We represent a document not as a single generic embedding but as multiple specialized embeddings. Our approach mitigates potential risks arising from implicit biases by making them explicit.
arXiv Detail & Related papers (2022-03-28T07:35:26Z)
Combining Deep Learning and Reasoning for Address Detection in Unstructured Text Documents [0.0]
We propose a hybrid approach that combines deep learning with reasoning for finding and extracting addresses from unstructured text documents. We use a visual deep learning model to detect the boundaries of possible address regions on the scanned document images.
arXiv Detail & Related papers (2022-02-07T12:32:00Z)
SPECTER: Document-level Representation Learning using Citation-informed Transformers [51.048515757909215]
SPECTER generates document-level embedding of scientific documents based on pretraining a Transformer language model. We introduce SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction to document classification and recommendation.
arXiv Detail & Related papers (2020-04-15T16:05:51Z)
Learning to Select Bi-Aspect Information for Document-Scale Text Content Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer. In detail, the input is a set of structured records and a reference text for describing another recordset. The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
Document Network Projection in Pretrained Word Embedding Space [7.455546102930911]
We present Regularized Linear Embedding (RLE), a novel method that projects a collection of linked documents into a pretrained word embedding space. We leverage a matrix of pairwise similarities providing complementary information (e.g., the network proximity of two documents in a citation graph) The document representations can help to solve many information retrieval tasks, such as recommendation, classification and clustering.
arXiv Detail & Related papers (2020-01-16T10:16:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.