Anchor Prediction: A Topic Modeling Approach
- URL: http://arxiv.org/abs/2205.14631v2
- Date: Wed, 1 Jun 2022 07:38:36 GMT
- Title: Anchor Prediction: A Topic Modeling Approach
- Authors: Jean Dupuy, Adrien Guille and Julien Jacques
- Abstract summary: We propose an annotation, which we refer to as anchor prediction.
Given a source document and a target document, this task consists in automatically identifying anchors in the source document.
We propose a contextualized relational topic model, CRTM, that models directed links between documents.
- Score: 2.0411082897313984
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Networks of documents connected by hyperlinks, such as Wikipedia, are
ubiquitous. Hyperlinks are inserted by the authors to enrich the text and
facilitate the navigation through the network. However, authors tend to insert
only a fraction of the relevant hyperlinks, mainly because this is a time
consuming task. In this paper we address an annotation, which we refer to as
anchor prediction. Even though it is conceptually close to link prediction or
entity linking, it is a different task that require developing a specific
method to solve it. Given a source document and a target document, this task
consists in automatically identifying anchors in the source document, i.e words
or terms that should carry a hyperlink pointing towards the target document. We
propose a contextualized relational topic model, CRTM, that models directed
links between documents as a function of the local context of the anchor in the
source document and the whole content of the target document. The model can be
used to predict anchors in a source document, given the target document,
without relying on a dictionary of previously seen mention or title, nor any
external knowledge graph. Authors can benefit from CRTM, by letting it
automatically suggest hyperlinks, given a new document and the set of target
document to connect to. It can also benefit to readers, by dynamically
inserting hyperlinks between the documents they're reading. Experiments
conducted on several Wikipedia corpora (in English, Italian and German)
highlight the practical usefulness of anchor prediction and demonstrate the
relevancy of our approach.
Related papers
- Directed Criteria Citation Recommendation and Ranking Through Link Prediction [0.32885740436059047]
Our model uses transformer-based graph embeddings to encode the meaning of each document, presented as a node within a citation network.
We show that the semantic representations that our model generates can outperform other content-based methods in recommendation and ranking tasks.
arXiv Detail & Related papers (2024-03-18T20:47:38Z) - FAMuS: Frames Across Multiple Sources [74.03795560933612]
FAMuS is a new corpus of Wikipedia passages that emphreport on some event, paired with underlying, genre-diverse (non-Wikipedia) emphsource articles for the same event.
We present results on two key event understanding tasks enabled by FAMuS.
arXiv Detail & Related papers (2023-11-09T18:57:39Z) - Anchor Prediction: Automatic Refinement of Internet Links [25.26235117917374]
We introduce the task of anchor prediction.
The goal is to identify the specific part of the linked target webpage that is most related to the source linking context.
We release the AuthorAnchors dataset, a collection of 34K naturally-occurring anchored links.
arXiv Detail & Related papers (2023-05-23T17:58:21Z) - Pre-training for Information Retrieval: Are Hyperlinks Fully Explored? [19.862211305690916]
We propose a progressive hyperlink predication (PHP) framework to explore the utilization of hyperlinks in pre-training.
Experimental results on two large-scale ad-hoc retrieval datasets and six question-answering datasets demonstrate its superiority over existing pre-training methods.
arXiv Detail & Related papers (2022-09-14T12:03:31Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - LinkBERT: Pretraining Language Models with Document Links [151.61148592954768]
Language model (LM) pretraining can learn various knowledge from text corpora, helping downstream tasks.
We propose LinkBERT, an LM pretraining method that leverages links between documents, e.g., hyperlinks.
We show that LinkBERT outperforms BERT on various downstream tasks across two domains.
arXiv Detail & Related papers (2022-03-29T18:01:24Z) - SelfDoc: Self-Supervised Document Representation Learning [46.22910270334824]
SelfDoc is a task-agnostic pre-training framework for document image understanding.
Our framework exploits the positional, textual, and visual information of every semantically meaningful component in a document.
It achieves superior performance on multiple downstream tasks with significantly fewer document images used in the pre-training stage compared to previous works.
arXiv Detail & Related papers (2021-06-07T04:19:49Z) - Predicting Links on Wikipedia with Anchor Text Information [0.571097144710995]
We study the transductive and the inductive tasks of link prediction on several subsets of the English Wikipedia.
We propose an appropriate evaluation sampling methodology and compare several algorithms.
arXiv Detail & Related papers (2021-05-25T07:57:57Z) - Multilevel Text Alignment with Cross-Document Attention [59.76351805607481]
Existing alignment methods operate at a single, predefined level.
We propose a new learning approach that equips previously established hierarchical attention encoders for representing documents with a cross-document attention component.
arXiv Detail & Related papers (2020-10-03T02:52:28Z) - SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level.
We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks.
We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z) - Document Network Projection in Pretrained Word Embedding Space [7.455546102930911]
We present Regularized Linear Embedding (RLE), a novel method that projects a collection of linked documents into a pretrained word embedding space.
We leverage a matrix of pairwise similarities providing complementary information (e.g., the network proximity of two documents in a citation graph)
The document representations can help to solve many information retrieval tasks, such as recommendation, classification and clustering.
arXiv Detail & Related papers (2020-01-16T10:16:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.