CorpusBrain: Pre-train a Generative Retrieval Model for
Knowledge-Intensive Language Tasks
- URL: http://arxiv.org/abs/2208.07652v1
- Date: Tue, 16 Aug 2022 10:22:49 GMT
- Title: CorpusBrain: Pre-train a Generative Retrieval Model for
Knowledge-Intensive Language Tasks
- Authors: Jiangui Chen, Ruqing Zhang, Jiafeng Guo, Yiqun Liu, Yixing Fan, Xueqi
Cheng
- Abstract summary: Single-step generative model can dramatically simplify the search process and be optimized in end-to-end manner.
We name the pre-trained generative retrieval model as CorpusBrain as all information about the corpus is encoded in its parameters without the need of constructing additional index.
- Score: 62.22920673080208
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge-intensive language tasks (KILT) usually require a large body of
information to provide correct answers. A popular paradigm to solve this
problem is to combine a search system with a machine reader, where the former
retrieves supporting evidences and the latter examines them to produce answers.
Recently, the reader component has witnessed significant advances with the help
of large-scale pre-trained generative models. Meanwhile most existing solutions
in the search component rely on the traditional ``index-retrieve-then-rank''
pipeline, which suffers from large memory footprint and difficulty in
end-to-end optimization. Inspired by recent efforts in constructing model-based
IR models, we propose to replace the traditional multi-step search pipeline
with a novel single-step generative model, which can dramatically simplify the
search process and be optimized in an end-to-end manner. We show that a strong
generative retrieval model can be learned with a set of adequately designed
pre-training tasks, and be adopted to improve a variety of downstream KILT
tasks with further fine-tuning. We name the pre-trained generative retrieval
model as CorpusBrain as all information about the corpus is encoded in its
parameters without the need of constructing additional index. Empirical results
show that CorpusBrain can significantly outperform strong baselines for the
retrieval task on the KILT benchmark and establish new state-of-the-art
downstream performances. We also show that CorpusBrain works well under zero-
and low-resource settings.
Related papers
- Bootstrapped Pre-training with Dynamic Identifier Prediction for Generative Retrieval [108.9772640854136]
Generative retrieval uses differentiable search indexes to directly generate relevant document identifiers in response to a query.
Recent studies have highlighted the potential of a strong generative retrieval model, trained with carefully crafted pre-training tasks, to enhance downstream retrieval tasks via fine-tuning.
We introduce BootRet, a bootstrapped pre-training method for generative retrieval that dynamically adjusts document identifiers during pre-training to accommodate the continuing of the corpus.
arXiv Detail & Related papers (2024-07-16T08:42:36Z) - Reinforcement Learning with Generative Models for Compact Support Sets [10.041289551532804]
We propose a framework utilizing reinforcement learning as a control for foundation models.
Our framework produced excellent results, increasing classification accuracy by significant margins for no additional labelling or data cost.
arXiv Detail & Related papers (2024-04-25T02:48:16Z) - CorpusBrain++: A Continual Generative Pre-Training Framework for
Knowledge-Intensive Language Tasks [111.13988772503511]
Knowledge-intensive language tasks (KILTs) typically require retrieving relevant documents from trustworthy corpora, e.g., Wikipedia, to produce specific answers.
Very recently, a pre-trained generative retrieval model for KILTs, named CorpusBrain, was proposed and reached new state-of-the-art retrieval performance.
arXiv Detail & Related papers (2024-02-26T17:35:44Z) - CorpusLM: Towards a Unified Language Model on Corpus for Knowledge-Intensive Tasks [20.390672895839757]
Retrieval-augmented generation (RAG) has emerged as a popular solution to enhance factual accuracy.
Traditional retrieval modules often rely on large document index and disconnect with generative tasks.
We propose textbfCorpusLM, a unified language model that integrates generative retrieval, closed-book generation, and RAG.
arXiv Detail & Related papers (2024-02-02T06:44:22Z) - Enhancing Retrieval-Augmented Large Language Models with Iterative
Retrieval-Generation Synergy [164.83371924650294]
We show that strong performance can be achieved by a method we call Iter-RetGen, which synergizes retrieval and generation in an iterative manner.
A model output shows what might be needed to finish a task, and thus provides an informative context for retrieving more relevant knowledge.
Iter-RetGen processes all retrieved knowledge as a whole and largely preserves the flexibility in generation without structural constraints.
arXiv Detail & Related papers (2023-05-24T16:17:36Z) - How Does Generative Retrieval Scale to Millions of Passages? [68.98628807288972]
We conduct the first empirical study of generative retrieval techniques across various corpus scales.
We scale generative retrieval to millions of passages with a corpus of 8.8M passages and evaluating model sizes up to 11B parameters.
While generative retrieval is competitive with state-of-the-art dual encoders on small corpora, scaling to millions of passages remains an important and unsolved challenge.
arXiv Detail & Related papers (2023-05-19T17:33:38Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.