Related papers: Pre-training Tasks for Embedding-based Large-scale Retrieval

Pre-training Tasks for Embedding-based Large-scale Retrieval

URL: http://arxiv.org/abs/2002.03932v1
Date: Mon, 10 Feb 2020 16:44:00 GMT
Title: Pre-training Tasks for Embedding-based Large-scale Retrieval
Authors: Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang, Yiming Yang, Sanjiv Kumar
Abstract summary: We consider the large-scale query-document retrieval problem. Given a query (e.g., a question), return the set of relevant documents from a large document corpus. We show that the key ingredient of learning a strong embedding-based Transformer model is the set of pre-training tasks.
Score: 68.01167604281578
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We consider the large-scale query-document retrieval problem: given a query (e.g., a question), return the set of relevant documents (e.g., paragraphs containing the answer) from a large document corpus. This problem is often solved in two steps. The retrieval phase first reduces the solution space, returning a subset of candidate documents. The scoring phase then re-ranks the documents. Critically, the retrieval algorithm not only desires high recall but also requires to be highly efficient, returning candidates in time sublinear to the number of documents. Unlike the scoring phase witnessing significant advances recently due to the BERT-style pre-training tasks on cross-attention models, the retrieval phase remains less well studied. Most previous works rely on classic Information Retrieval (IR) methods such as BM-25 (token matching + TF-IDF weights). These models only accept sparse handcrafted features and can not be optimized for different downstream tasks of interest. In this paper, we conduct a comprehensive study on the embedding-based retrieval models. We show that the key ingredient of learning a strong embedding-based Transformer model is the set of pre-training tasks. With adequately designed paragraph-level pre-training tasks, the Transformer models can remarkably improve over the widely-used BM-25 as well as embedding models without Transformers. The paragraph-level pre-training tasks we studied are Inverse Cloze Task (ICT), Body First Selection (BFS), Wiki Link Prediction (WLP), and the combination of all three.

Related papers

Quam: Adaptive Retrieval through Query Affinity Modelling [15.3583908068962]
Building relevance models to rank documents based on user information needs is a central task in information retrieval and the NLP community. We propose a unifying view of the nascent area of adaptive retrieval by proposing, Quam. Our proposed approach, Quam improves the recall performance by up to 26% over the standard re-ranking baselines.
arXiv Detail & Related papers (2024-10-26T22:52:12Z)
List-aware Reranking-Truncation Joint Model for Search and Retrieval-augmented Generation [80.12531449946655]
We propose a Reranking-Truncation joint model (GenRT) that can perform the two tasks concurrently. GenRT integrates reranking and truncation via generative paradigm based on encoder-decoder architecture. Our method achieves SOTA performance on both reranking and truncation tasks for web search and retrieval-augmented LLMs.
arXiv Detail & Related papers (2024-02-05T06:52:53Z)
Peek Across: Improving Multi-Document Modeling via Cross-Document Question-Answering [49.85790367128085]
We pre-training a generic multi-document model from a novel cross-document question answering pre-training objective. This novel multi-document QA formulation directs the model to better recover cross-text informational relations. Unlike prior multi-document models that focus on either classification or summarization tasks, our pre-training objective formulation enables the model to perform tasks that involve both short text generation and long text generation.
arXiv Detail & Related papers (2023-05-24T17:48:40Z)
DSI++: Updating Transformer Memory with New Documents [95.70264288158766]
We introduce DSI++, a continual learning challenge for DSI to incrementally index new documents. We show that continual indexing of new documents leads to considerable forgetting of previously indexed documents. We introduce a generative memory to sample pseudo-queries for documents and supplement them during continual indexing to prevent forgetting for the retrieval task.
arXiv Detail & Related papers (2022-12-19T18:59:34Z)
Retrieval as Attention: End-to-end Learning of Retrieval and Reading within a Single Transformer [80.50327229467993]
We show that a single model trained end-to-end can achieve both competitive retrieval and QA performance. We show that end-to-end adaptation significantly boosts its performance on out-of-domain datasets in both supervised and unsupervised settings.
arXiv Detail & Related papers (2022-12-05T04:51:21Z)
Questions Are All You Need to Train a Dense Passage Retriever [123.13872383489172]
ART is a new corpus-level autoencoding approach for training dense retrieval models that does not require any labeled training data. It uses a new document-retrieval autoencoding scheme, where (1) an input question is used to retrieve a set of evidence documents, and (2) the documents are then used to compute the probability of reconstructing the original question.
arXiv Detail & Related papers (2022-06-21T18:16:31Z)
Improving Document Representations by Generating Pseudo Query Embeddings for Dense Retrieval [11.465218502487959]
We design a method to mimic the queries on each of the documents by an iterative clustering process. We also optimize the matching function with a two-step score calculation procedure. Experimental results on several popular ranking and QA datasets show that our model can achieve state-of-the-art results.
arXiv Detail & Related papers (2021-05-08T05:28:24Z)
Goldilocks: Just-Right Tuning of BERT for Technology-Assisted Review [14.689883695115519]
Technology-assisted review (TAR) refers to iterative active learning for document review in high recall retrieval tasks. Transformer-based models with supervised tuning have been found to improve effectiveness on many text classification tasks. We show that just-right language model fine-tuning on the task collection before starting active learning is critical.
arXiv Detail & Related papers (2021-05-03T17:41:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.