CorpusBrain++: A Continual Generative Pre-Training Framework for
Knowledge-Intensive Language Tasks
- URL: http://arxiv.org/abs/2402.16767v1
- Date: Mon, 26 Feb 2024 17:35:44 GMT
- Title: CorpusBrain++: A Continual Generative Pre-Training Framework for
Knowledge-Intensive Language Tasks
- Authors: Jiafeng Guo, Changjiang Zhou, Ruqing Zhang, Jiangui Chen, Maarten de
Rijke, Yixing Fan and Xueqi Cheng
- Abstract summary: Knowledge-intensive language tasks (KILTs) typically require retrieving relevant documents from trustworthy corpora, e.g., Wikipedia, to produce specific answers.
Very recently, a pre-trained generative retrieval model for KILTs, named CorpusBrain, was proposed and reached new state-of-the-art retrieval performance.
- Score: 111.13988772503511
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge-intensive language tasks (KILTs) typically require retrieving
relevant documents from trustworthy corpora, e.g., Wikipedia, to produce
specific answers. Very recently, a pre-trained generative retrieval model for
KILTs, named CorpusBrain, was proposed and reached new state-of-the-art
retrieval performance. However, most existing research on KILTs, including
CorpusBrain, has predominantly focused on a static document collection,
overlooking the dynamic nature of real-world scenarios, where new documents are
continuously being incorporated into the source corpus. To address this gap, it
is crucial to explore the capability of retrieval models to effectively handle
the dynamic retrieval scenario inherent in KILTs.
In this work, we first introduce the continual document learning (CDL) task
for KILTs and build a novel benchmark dataset named KILT++ based on the
original KILT dataset for evaluation. Then, we conduct a comprehensive study
over the use of pre-trained CorpusBrain on KILT++. Unlike the promising results
in the stationary scenario, CorpusBrain is prone to catastrophic forgetting in
the dynamic scenario, hence hampering the retrieval performance. To alleviate
this issue, we propose CorpusBrain++, a continual generative pre-training
framework. Empirical results demonstrate the significant effectiveness and
remarkable efficiency of CorpusBrain++ in comparison to both traditional and
generative IR methods.
Related papers
- CorpusLM: Towards a Unified Language Model on Corpus for Knowledge-Intensive Tasks [20.390672895839757]
Retrieval-augmented generation (RAG) has emerged as a popular solution to enhance factual accuracy.
Traditional retrieval modules often rely on large document index and disconnect with generative tasks.
We propose textbfCorpusLM, a unified language model that integrates generative retrieval, closed-book generation, and RAG.
arXiv Detail & Related papers (2024-02-02T06:44:22Z) - Wikiformer: Pre-training with Structured Information of Wikipedia for
Ad-hoc Retrieval [21.262531222066208]
In this paper, we devise four pre-training objectives tailored for information retrieval tasks based on the structured knowledge of Wikipedia.
Compared to existing pre-training methods, our approach can better capture the semantic knowledge in the training corpus.
Experimental results in biomedical and legal domains demonstrate that our approach achieves better performance in vertical domains.
arXiv Detail & Related papers (2023-12-17T09:31:47Z) - WebBrain: Learning to Generate Factually Correct Articles for Queries by
Grounding on Large Web Corpus [61.209202634703104]
We introduce a new NLP task -- generating short factual articles with references for queries by mining supporting evidence from the Web.
The ultimate goal is to generate a fluent, informative, and factually-correct short article for a factual query unseen in Wikipedia.
We construct a large-scale dataset WebBrain-Raw by extracting English Wikipedia articles and their crawlable Wikipedia references.
arXiv Detail & Related papers (2023-04-10T02:55:48Z) - Rediscovery of CNN's Versatility for Text-based Encoding of Raw
Electronic Health Records [22.203204279166496]
We search for a versatile encoder not only reducing the large data into a manageable size but also well preserving the core information of patients to perform diverse clinical tasks.
We found that hierarchically structured Convolutional Neural Network (CNN) often outperforms the state-of-the-art model on diverse tasks.
arXiv Detail & Related papers (2023-03-15T00:37:18Z) - CorpusBrain: Pre-train a Generative Retrieval Model for
Knowledge-Intensive Language Tasks [62.22920673080208]
Single-step generative model can dramatically simplify the search process and be optimized in end-to-end manner.
We name the pre-trained generative retrieval model as CorpusBrain as all information about the corpus is encoded in its parameters without the need of constructing additional index.
arXiv Detail & Related papers (2022-08-16T10:22:49Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z) - Progressively Pretrained Dense Corpus Index for Open-Domain Question
Answering [87.32442219333046]
We propose a simple and resource-efficient method to pretrain the paragraph encoder.
Our method outperforms an existing dense retrieval method that uses 7 times more computational resources for pretraining.
arXiv Detail & Related papers (2020-04-30T18:09:50Z) - Automatic Gesture Recognition in Robot-assisted Surgery with
Reinforcement Learning and Tree Search [63.07088785532908]
We propose a framework based on reinforcement learning and tree search for joint surgical gesture segmentation and classification.
Our framework consistently outperforms the existing methods on the suturing task of JIGSAWS dataset in terms of accuracy, edit score and F1 score.
arXiv Detail & Related papers (2020-02-20T13:12:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.