Document Network Projection in Pretrained Word Embedding Space
- URL: http://arxiv.org/abs/2001.05727v1
- Date: Thu, 16 Jan 2020 10:16:37 GMT
- Title: Document Network Projection in Pretrained Word Embedding Space
- Authors: Antoine Gourru, Adrien Guille, Julien Velcin and Julien Jacques
- Abstract summary: We present Regularized Linear Embedding (RLE), a novel method that projects a collection of linked documents into a pretrained word embedding space.
We leverage a matrix of pairwise similarities providing complementary information (e.g., the network proximity of two documents in a citation graph)
The document representations can help to solve many information retrieval tasks, such as recommendation, classification and clustering.
- Score: 7.455546102930911
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Regularized Linear Embedding (RLE), a novel method that projects a
collection of linked documents (e.g. citation network) into a pretrained word
embedding space. In addition to the textual content, we leverage a matrix of
pairwise similarities providing complementary information (e.g., the network
proximity of two documents in a citation graph). We first build a simple word
vector average for each document, and we use the similarities to alter this
average representation. The document representations can help to solve many
information retrieval tasks, such as recommendation, classification and
clustering. We demonstrate that our approach outperforms or matches existing
document network embedding methods on node classification and link prediction
tasks. Furthermore, we show that it helps identifying relevant keywords to
describe document classes.
Related papers
- Unified Multi-Modal Interleaved Document Representation for Information Retrieval [57.65409208879344]
We produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities.
Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation.
arXiv Detail & Related papers (2024-10-03T17:49:09Z) - Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - Directed Criteria Citation Recommendation and Ranking Through Link Prediction [0.32885740436059047]
Our model uses transformer-based graph embeddings to encode the meaning of each document, presented as a node within a citation network.
We show that the semantic representations that our model generates can outperform other content-based methods in recommendation and ranking tasks.
arXiv Detail & Related papers (2024-03-18T20:47:38Z) - Lbl2Vec: An Embedding-Based Approach for Unsupervised Document Retrieval
on Predefined Topics [0.6767885381740952]
We introduce a method that learns jointly embedded document and word vectors solely from the unlabeled document dataset.
The proposed method requires almost no text preprocessing but is simultaneously effective at retrieving relevant documents with high probability.
For easy replication of our approach, we make the developed Lbl2Vec code publicly available as a ready-to-use tool under the 3-Clause BSD license.
arXiv Detail & Related papers (2022-10-12T08:57:01Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z) - Multilevel Text Alignment with Cross-Document Attention [59.76351805607481]
Existing alignment methods operate at a single, predefined level.
We propose a new learning approach that equips previously established hierarchical attention encoders for representing documents with a cross-document attention component.
arXiv Detail & Related papers (2020-10-03T02:52:28Z) - Pairwise Multi-Class Document Classification for Semantic Relations
between Wikipedia Articles [5.40541521227338]
We model the problem of finding the relationship between two documents as a pairwise document classification task.
To find semantic relation between documents, we apply a series of techniques, such as GloVe, paragraph-s, BERT, and XLNet.
We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations.
arXiv Detail & Related papers (2020-03-22T12:52:56Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z) - Inductive Document Network Embedding with Topic-Word Attention [5.8010446129208155]
Document network embedding aims at learning representations for a structured text corpus when documents are linked to each other.
Recent algorithms extend network embedding approaches by incorporating the text content associated with the nodes in their formulations.
In this paper, we propose an interpretable and inductive document network embedding method.
arXiv Detail & Related papers (2020-01-10T10:14:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.