CorpusBrain++: A Continual Generative Pre-Training Framework for
Knowledge-Intensive Language Tasks
- URL: http://arxiv.org/abs/2402.16767v1
- Date: Mon, 26 Feb 2024 17:35:44 GMT
- Title: CorpusBrain++: A Continual Generative Pre-Training Framework for
Knowledge-Intensive Language Tasks
- Authors: Jiafeng Guo, Changjiang Zhou, Ruqing Zhang, Jiangui Chen, Maarten de
Rijke, Yixing Fan and Xueqi Cheng
- Abstract summary: Knowledge-intensive language tasks (KILTs) typically require retrieving relevant documents from trustworthy corpora, e.g., Wikipedia, to produce specific answers.
Very recently, a pre-trained generative retrieval model for KILTs, named CorpusBrain, was proposed and reached new state-of-the-art retrieval performance.
- Score: 111.13988772503511
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge-intensive language tasks (KILTs) typically require retrieving
relevant documents from trustworthy corpora, e.g., Wikipedia, to produce
specific answers. Very recently, a pre-trained generative retrieval model for
KILTs, named CorpusBrain, was proposed and reached new state-of-the-art
retrieval performance. However, most existing research on KILTs, including
CorpusBrain, has predominantly focused on a static document collection,
overlooking the dynamic nature of real-world scenarios, where new documents are
continuously being incorporated into the source corpus. To address this gap, it
is crucial to explore the capability of retrieval models to effectively handle
the dynamic retrieval scenario inherent in KILTs.
In this work, we first introduce the continual document learning (CDL) task
for KILTs and build a novel benchmark dataset named KILT++ based on the
original KILT dataset for evaluation. Then, we conduct a comprehensive study
over the use of pre-trained CorpusBrain on KILT++. Unlike the promising results
in the stationary scenario, CorpusBrain is prone to catastrophic forgetting in
the dynamic scenario, hence hampering the retrieval performance. To alleviate
this issue, we propose CorpusBrain++, a continual generative pre-training
framework. Empirical results demonstrate the significant effectiveness and
remarkable efficiency of CorpusBrain++ in comparison to both traditional and
generative IR methods.
Related papers
- Bootstrapped Pre-training with Dynamic Identifier Prediction for Generative Retrieval [108.9772640854136]
Generative retrieval uses differentiable search indexes to directly generate relevant document identifiers in response to a query.
Recent studies have highlighted the potential of a strong generative retrieval model, trained with carefully crafted pre-training tasks, to enhance downstream retrieval tasks via fine-tuning.
We introduce BootRet, a bootstrapped pre-training method for generative retrieval that dynamically adjusts document identifiers during pre-training to accommodate the continuing of the corpus.
arXiv Detail & Related papers (2024-07-16T08:42:36Z) - CorpusLM: Towards a Unified Language Model on Corpus for Knowledge-Intensive Tasks [20.390672895839757]
Retrieval-augmented generation (RAG) has emerged as a popular solution to enhance factual accuracy.
Traditional retrieval modules often rely on large document index and disconnect with generative tasks.
We propose textbfCorpusLM, a unified language model that integrates generative retrieval, closed-book generation, and RAG.
arXiv Detail & Related papers (2024-02-02T06:44:22Z) - Wikiformer: Pre-training with Structured Information of Wikipedia for
Ad-hoc Retrieval [21.262531222066208]
In this paper, we devise four pre-training objectives tailored for information retrieval tasks based on the structured knowledge of Wikipedia.
Compared to existing pre-training methods, our approach can better capture the semantic knowledge in the training corpus.
Experimental results in biomedical and legal domains demonstrate that our approach achieves better performance in vertical domains.
arXiv Detail & Related papers (2023-12-17T09:31:47Z) - Dense X Retrieval: What Retrieval Granularity Should We Use? [56.90827473115201]
Often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence.
We introduce a novel retrieval unit, proposition, for dense retrieval.
Experiments reveal that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks.
arXiv Detail & Related papers (2023-12-11T18:57:35Z) - WebBrain: Learning to Generate Factually Correct Articles for Queries by
Grounding on Large Web Corpus [61.209202634703104]
We introduce a new NLP task -- generating short factual articles with references for queries by mining supporting evidence from the Web.
The ultimate goal is to generate a fluent, informative, and factually-correct short article for a factual query unseen in Wikipedia.
We construct a large-scale dataset WebBrain-Raw by extracting English Wikipedia articles and their crawlable Wikipedia references.
arXiv Detail & Related papers (2023-04-10T02:55:48Z) - CorpusBrain: Pre-train a Generative Retrieval Model for
Knowledge-Intensive Language Tasks [62.22920673080208]
Single-step generative model can dramatically simplify the search process and be optimized in end-to-end manner.
We name the pre-trained generative retrieval model as CorpusBrain as all information about the corpus is encoded in its parameters without the need of constructing additional index.
arXiv Detail & Related papers (2022-08-16T10:22:49Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.