Pre-training Tasks for Embedding-based Large-scale Retrieval
- URL: http://arxiv.org/abs/2002.03932v1
- Date: Mon, 10 Feb 2020 16:44:00 GMT
- Title: Pre-training Tasks for Embedding-based Large-scale Retrieval
- Authors: Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang, Yiming Yang, Sanjiv Kumar
- Abstract summary: We consider the large-scale query-document retrieval problem.
Given a query (e.g., a question), return the set of relevant documents from a large document corpus.
We show that the key ingredient of learning a strong embedding-based Transformer model is the set of pre-training tasks.
- Score: 68.01167604281578
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the large-scale query-document retrieval problem: given a query
(e.g., a question), return the set of relevant documents (e.g., paragraphs
containing the answer) from a large document corpus. This problem is often
solved in two steps. The retrieval phase first reduces the solution space,
returning a subset of candidate documents. The scoring phase then re-ranks the
documents. Critically, the retrieval algorithm not only desires high recall but
also requires to be highly efficient, returning candidates in time sublinear to
the number of documents. Unlike the scoring phase witnessing significant
advances recently due to the BERT-style pre-training tasks on cross-attention
models, the retrieval phase remains less well studied. Most previous works rely
on classic Information Retrieval (IR) methods such as BM-25 (token matching +
TF-IDF weights). These models only accept sparse handcrafted features and can
not be optimized for different downstream tasks of interest. In this paper, we
conduct a comprehensive study on the embedding-based retrieval models. We show
that the key ingredient of learning a strong embedding-based Transformer model
is the set of pre-training tasks. With adequately designed paragraph-level
pre-training tasks, the Transformer models can remarkably improve over the
widely-used BM-25 as well as embedding models without Transformers. The
paragraph-level pre-training tasks we studied are Inverse Cloze Task (ICT),
Body First Selection (BFS), Wiki Link Prediction (WLP), and the combination of
all three.
Related papers
- Quam: Adaptive Retrieval through Query Affinity Modelling [15.3583908068962]
Building relevance models to rank documents based on user information needs is a central task in information retrieval and the NLP community.
We propose a unifying view of the nascent area of adaptive retrieval by proposing, Quam.
Our proposed approach, Quam improves the recall performance by up to 26% over the standard re-ranking baselines.
arXiv Detail & Related papers (2024-10-26T22:52:12Z) - List-aware Reranking-Truncation Joint Model for Search and
Retrieval-augmented Generation [80.12531449946655]
We propose a Reranking-Truncation joint model (GenRT) that can perform the two tasks concurrently.
GenRT integrates reranking and truncation via generative paradigm based on encoder-decoder architecture.
Our method achieves SOTA performance on both reranking and truncation tasks for web search and retrieval-augmented LLMs.
arXiv Detail & Related papers (2024-02-05T06:52:53Z) - Peek Across: Improving Multi-Document Modeling via Cross-Document
Question-Answering [49.85790367128085]
We pre-training a generic multi-document model from a novel cross-document question answering pre-training objective.
This novel multi-document QA formulation directs the model to better recover cross-text informational relations.
Unlike prior multi-document models that focus on either classification or summarization tasks, our pre-training objective formulation enables the model to perform tasks that involve both short text generation and long text generation.
arXiv Detail & Related papers (2023-05-24T17:48:40Z) - DSI++: Updating Transformer Memory with New Documents [95.70264288158766]
We introduce DSI++, a continual learning challenge for DSI to incrementally index new documents.
We show that continual indexing of new documents leads to considerable forgetting of previously indexed documents.
We introduce a generative memory to sample pseudo-queries for documents and supplement them during continual indexing to prevent forgetting for the retrieval task.
arXiv Detail & Related papers (2022-12-19T18:59:34Z) - Retrieval as Attention: End-to-end Learning of Retrieval and Reading
within a Single Transformer [80.50327229467993]
We show that a single model trained end-to-end can achieve both competitive retrieval and QA performance.
We show that end-to-end adaptation significantly boosts its performance on out-of-domain datasets in both supervised and unsupervised settings.
arXiv Detail & Related papers (2022-12-05T04:51:21Z) - Questions Are All You Need to Train a Dense Passage Retriever [123.13872383489172]
ART is a new corpus-level autoencoding approach for training dense retrieval models that does not require any labeled training data.
It uses a new document-retrieval autoencoding scheme, where (1) an input question is used to retrieve a set of evidence documents, and (2) the documents are then used to compute the probability of reconstructing the original question.
arXiv Detail & Related papers (2022-06-21T18:16:31Z) - Improving Document Representations by Generating Pseudo Query Embeddings
for Dense Retrieval [11.465218502487959]
We design a method to mimic the queries on each of the documents by an iterative clustering process.
We also optimize the matching function with a two-step score calculation procedure.
Experimental results on several popular ranking and QA datasets show that our model can achieve state-of-the-art results.
arXiv Detail & Related papers (2021-05-08T05:28:24Z) - Goldilocks: Just-Right Tuning of BERT for Technology-Assisted Review [14.689883695115519]
Technology-assisted review (TAR) refers to iterative active learning for document review in high recall retrieval tasks.
Transformer-based models with supervised tuning have been found to improve effectiveness on many text classification tasks.
We show that just-right language model fine-tuning on the task collection before starting active learning is critical.
arXiv Detail & Related papers (2021-05-03T17:41:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.