Pre-training for Information Retrieval: Are Hyperlinks Fully Explored?
- URL: http://arxiv.org/abs/2209.06583v1
- Date: Wed, 14 Sep 2022 12:03:31 GMT
- Title: Pre-training for Information Retrieval: Are Hyperlinks Fully Explored?
- Authors: Jiawen Wu, Xinyu Zhang, Yutao Zhu, Zheng Liu, Zikai Guo, Zhaoye Fei,
Ruofei Lai, Yongkang Wu, Zhao Cao, Zhicheng Dou
- Abstract summary: We propose a progressive hyperlink predication (PHP) framework to explore the utilization of hyperlinks in pre-training.
Experimental results on two large-scale ad-hoc retrieval datasets and six question-answering datasets demonstrate its superiority over existing pre-training methods.
- Score: 19.862211305690916
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent years have witnessed great progress on applying pre-trained language
models, e.g., BERT, to information retrieval (IR) tasks. Hyperlinks, which are
commonly used in Web pages, have been leveraged for designing pre-training
objectives. For example, anchor texts of the hyperlinks have been used for
simulating queries, thus constructing tremendous query-document pairs for
pre-training. However, as a bridge across two web pages, the potential of
hyperlinks has not been fully explored. In this work, we focus on modeling the
relationship between two documents that are connected by hyperlinks and
designing a new pre-training objective for ad-hoc retrieval. Specifically, we
categorize the relationships between documents into four groups: no link,
unidirectional link, symmetric link, and the most relevant symmetric link. By
comparing two documents sampled from adjacent groups, the model can gradually
improve its capability of capturing matching signals. We propose a progressive
hyperlink predication ({PHP}) framework to explore the utilization of
hyperlinks in pre-training. Experimental results on two large-scale ad-hoc
retrieval datasets and six question-answering datasets demonstrate its
superiority over existing pre-training methods.
Related papers
- Generative Retrieval Meets Multi-Graded Relevance [104.75244721442756]
We introduce a framework called GRaded Generative Retrieval (GR$2$)
GR$2$ focuses on two key components: ensuring relevant and distinct identifiers, and implementing multi-graded constrained contrastive training.
Experiments on datasets with both multi-graded and binary relevance demonstrate the effectiveness of GR$2$.
arXiv Detail & Related papers (2024-09-27T02:55:53Z) - Bootstrapped Pre-training with Dynamic Identifier Prediction for Generative Retrieval [108.9772640854136]
Generative retrieval uses differentiable search indexes to directly generate relevant document identifiers in response to a query.
Recent studies have highlighted the potential of a strong generative retrieval model, trained with carefully crafted pre-training tasks, to enhance downstream retrieval tasks via fine-tuning.
We introduce BootRet, a bootstrapped pre-training method for generative retrieval that dynamically adjusts document identifiers during pre-training to accommodate the continuing of the corpus.
arXiv Detail & Related papers (2024-07-16T08:42:36Z) - Query-oriented Data Augmentation for Session Search [71.84678750612754]
We propose query-oriented data augmentation to enrich search logs and empower the modeling.
We generate supplemental training pairs by altering the most important part of a search context.
We develop several strategies to alter the current query, resulting in new training data with varying degrees of difficulty.
arXiv Detail & Related papers (2024-07-04T08:08:33Z) - Improving Topic Relevance Model by Mix-structured Summarization and LLM-based Data Augmentation [16.170841777591345]
In most social search scenarios such as Dianping, modeling search relevance always faces two challenges.
We first take queryd with the query-based summary and the document summary without query as the input of topic relevance model.
Then, we utilize the language understanding and generation abilities of large language model (LLM) to rewrite and generate query from queries and documents in existing training data.
arXiv Detail & Related papers (2024-04-03T10:05:47Z) - A Semantic Mention Graph Augmented Model for Document-Level Event Argument Extraction [12.286432133599355]
Document-level Event Argument Extraction (DEAE) aims to identify arguments and their specific roles from an unstructured document.
advanced approaches on DEAE utilize prompt-based methods to guide pre-trained language models (PLMs) in extracting arguments from input documents.
We propose a semantic mention Graph Augmented Model (GAM) to address these two problems in this paper.
arXiv Detail & Related papers (2024-03-12T08:58:07Z) - Peek Across: Improving Multi-Document Modeling via Cross-Document
Question-Answering [49.85790367128085]
We pre-training a generic multi-document model from a novel cross-document question answering pre-training objective.
This novel multi-document QA formulation directs the model to better recover cross-text informational relations.
Unlike prior multi-document models that focus on either classification or summarization tasks, our pre-training objective formulation enables the model to perform tasks that involve both short text generation and long text generation.
arXiv Detail & Related papers (2023-05-24T17:48:40Z) - Bi-Link: Bridging Inductive Link Predictions from Text via Contrastive
Learning of Transformers and Prompts [2.9972063833424216]
We propose Bi-Link, a contrastive learning framework with probabilistic syntax prompts for link predictions.
Using grammatical knowledge of BERT, we efficiently search for relational prompts according to learnt syntactical patterns that generalize to large knowledge graphs.
In our experiments, Bi-Link outperforms recent baselines on link prediction datasets.
arXiv Detail & Related papers (2022-10-26T04:31:07Z) - Incorporating Relevance Feedback for Information-Seeking Retrieval using
Few-Shot Document Re-Ranking [56.80065604034095]
We introduce a kNN approach that re-ranks documents based on their similarity with the query and the documents the user considers relevant.
To evaluate our different integration strategies, we transform four existing information retrieval datasets into the relevance feedback scenario.
arXiv Detail & Related papers (2022-10-19T16:19:37Z) - Anchor Prediction: A Topic Modeling Approach [2.0411082897313984]
We propose an annotation, which we refer to as anchor prediction.
Given a source document and a target document, this task consists in automatically identifying anchors in the source document.
We propose a contextualized relational topic model, CRTM, that models directed links between documents.
arXiv Detail & Related papers (2022-05-29T11:26:52Z) - Integrating Semantics and Neighborhood Information with Graph-Driven
Generative Models for Document Retrieval [51.823187647843945]
In this paper, we encode the neighborhood information with a graph-induced Gaussian distribution, and propose to integrate the two types of information with a graph-driven generative model.
Under the approximation, we prove that the training objective can be decomposed into terms involving only singleton or pairwise documents, enabling the model to be trained as efficiently as uncorrelated ones.
arXiv Detail & Related papers (2021-05-27T11:29:03Z) - Predicting Links on Wikipedia with Anchor Text Information [0.571097144710995]
We study the transductive and the inductive tasks of link prediction on several subsets of the English Wikipedia.
We propose an appropriate evaluation sampling methodology and compare several algorithms.
arXiv Detail & Related papers (2021-05-25T07:57:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.