Unsupervised Dense Retrieval Training with Web Anchors
- URL: http://arxiv.org/abs/2305.05834v1
- Date: Wed, 10 May 2023 01:46:17 GMT
- Title: Unsupervised Dense Retrieval Training with Web Anchors
- Authors: Yiqing Xie, Xiao Liu, Chenyan Xiong
- Abstract summary: We train an unsupervised dense retriever, Anchor-DR, with a contrastive learning task that matches the anchor text and the linked document.
Experiments show that Anchor-DR outperforms state-of-the-art methods on unsupervised dense retrieval by a large margin.
Our analysis further reveals that the pattern of anchor-document pairs is similar to that of search query-document pairs.
- Score: 29.44275536993025
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we present an unsupervised retrieval method with contrastive
learning on web anchors. The anchor text describes the content that is
referenced from the linked page. This shows similarities to search queries that
aim to retrieve pertinent information from relevant documents. Based on their
commonalities, we train an unsupervised dense retriever, Anchor-DR, with a
contrastive learning task that matches the anchor text and the linked document.
To filter out uninformative anchors (such as ``homepage'' or other functional
anchors), we present a novel filtering technique to only select anchors that
contain similar types of information as search queries. Experiments show that
Anchor-DR outperforms state-of-the-art methods on unsupervised dense retrieval
by a large margin (e.g., by 5.3% NDCG@10 on MSMARCO). The gain of our method is
especially significant for search and question answering tasks. Our analysis
further reveals that the pattern of anchor-document pairs is similar to that of
search query-document pairs. Code available at
https://github.com/Veronicium/AnchorDR.
Related papers
- Query-oriented Data Augmentation for Session Search [71.84678750612754]
We propose query-oriented data augmentation to enrich search logs and empower the modeling.
We generate supplemental training pairs by altering the most important part of a search context.
We develop several strategies to alter the current query, resulting in new training data with varying degrees of difficulty.
arXiv Detail & Related papers (2024-07-04T08:08:33Z) - Unifying Multimodal Retrieval via Document Screenshot Embedding [92.03571344075607]
Document Screenshot Embedding (DSE) is a novel retrieval paradigm that regards document screenshots as a unified input format.
We first craft the dataset of Wiki-SS, a 1.3M Wikipedia web page screenshots as the corpus to answer the questions from the Natural Questions dataset.
In such a text-intensive document retrieval setting, DSE shows competitive effectiveness compared to other text retrieval methods relying on parsing.
arXiv Detail & Related papers (2024-06-17T06:27:35Z) - Multiview Identifiers Enhanced Generative Retrieval [78.38443356800848]
generative retrieval generates identifier strings of passages as the retrieval target.
We propose a new type of identifier, synthetic identifiers, that are generated based on the content of a passage.
Our proposed approach performs the best in generative retrieval, demonstrating its effectiveness and robustness.
arXiv Detail & Related papers (2023-05-26T06:50:21Z) - Referral Augmentation for Zero-Shot Information Retrieval [30.811093210831018]
Referral-Augmented Retrieval (RAR) is a simple technique that links document indices with referrals.
RAR works with both sparse and dense retrievers, and outperforms generative text expansion techniques.
We analyze different methods for multi-referral aggregation and show that enables up-to-date information retrieval without re-training.
arXiv Detail & Related papers (2023-05-24T12:28:35Z) - Decomposing Complex Queries for Tip-of-the-tongue Retrieval [72.07449449115167]
Complex queries describe content elements (e.g., book characters or events), information beyond the document text.
This retrieval setting, called tip of the tongue (TOT), is especially challenging for models reliant on lexical and semantic overlap between query and document text.
We introduce a simple yet effective framework for handling such complex queries by decomposing the query into individual clues, routing those as sub-queries to specialized retrievers, and ensembling the results.
arXiv Detail & Related papers (2023-05-24T11:43:40Z) - Anchor Prediction: Automatic Refinement of Internet Links [25.26235117917374]
We introduce the task of anchor prediction.
The goal is to identify the specific part of the linked target webpage that is most related to the source linking context.
We release the AuthorAnchors dataset, a collection of 34K naturally-occurring anchored links.
arXiv Detail & Related papers (2023-05-23T17:58:21Z) - DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR)
While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context.
Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z) - Precise Zero-Shot Dense Retrieval without Relevance Labels [60.457378374671656]
Hypothetical Document Embeddings(HyDE) is a zero-shot dense retrieval system.
We show that HyDE significantly outperforms the state-of-the-art unsupervised dense retriever Contriever.
arXiv Detail & Related papers (2022-12-20T18:09:52Z) - Anchor Prediction: A Topic Modeling Approach [2.0411082897313984]
We propose an annotation, which we refer to as anchor prediction.
Given a source document and a target document, this task consists in automatically identifying anchors in the source document.
We propose a contextualized relational topic model, CRTM, that models directed links between documents.
arXiv Detail & Related papers (2022-05-29T11:26:52Z) - Predicting Links on Wikipedia with Anchor Text Information [0.571097144710995]
We study the transductive and the inductive tasks of link prediction on several subsets of the English Wikipedia.
We propose an appropriate evaluation sampling methodology and compare several algorithms.
arXiv Detail & Related papers (2021-05-25T07:57:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.