Pre-training with Large Language Model-based Document Expansion for
Dense Passage Retrieval
- URL: http://arxiv.org/abs/2308.08285v1
- Date: Wed, 16 Aug 2023 11:10:43 GMT
- Title: Pre-training with Large Language Model-based Document Expansion for
Dense Passage Retrieval
- Authors: Guangyuan Ma, Xing Wu, Peng Wang, Zijia Lin, Songlin Hu
- Abstract summary: We study the potential of pre-training with Large Language Model(LLM)-based document expansion for dense passage retrieval.
Our work shows strong zero-shot and out-of-domain retrieval abilities, making it more widely applicable for retrieval when initializing with no human-labeled data.
- Score: 28.906829093158592
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we systematically study the potential of pre-training with
Large Language Model(LLM)-based document expansion for dense passage retrieval.
Concretely, we leverage the capabilities of LLMs for document expansion, i.e.
query generation, and effectively transfer expanded knowledge to retrievers
using pre-training strategies tailored for passage retrieval. These strategies
include contrastive learning and bottlenecked query generation. Furthermore, we
incorporate a curriculum learning strategy to reduce the reliance on LLM
inferences. Experimental results demonstrate that pre-training with LLM-based
document expansion significantly boosts the retrieval performance on
large-scale web-search tasks. Our work shows strong zero-shot and out-of-domain
retrieval abilities, making it more widely applicable for retrieval when
initializing with no human-labeled data.
Related papers
- Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data [76.90128359866462]
We investigate the interplay between generalization and memorization in large language models at scale.
With various sizes of open-source LLMs and their pretraining corpora, we observe that as the model size increases, the task-relevant $n$-gram pair data becomes increasingly important.
Our results support the hypothesis that LLMs' capabilities emerge from a delicate balance of memorization and generalization with sufficient task-related pretraining data.
arXiv Detail & Related papers (2024-07-20T21:24:40Z) - R^2AG: Incorporating Retrieval Information into Retrieval Augmented Generation [11.890598082534577]
Retrieval augmented generation (RAG) has been applied in many scenarios to augment large language models (LLMs) with external documents provided by retrievers.
This paper proposes R$2$AG, a novel enhanced RAG framework that incorporates Retrieval information into Retrieval Augmented Generation.
arXiv Detail & Related papers (2024-06-19T06:19:48Z) - Enhancing Biomedical Knowledge Retrieval-Augmented Generation with Self-Rewarding Tree Search and Proximal Policy Optimization [50.26966969163348]
Large Language Models (LLMs) have shown great potential in the biomedical domain with the advancement of retrieval-augmented generation (RAG)
Existing retrieval-augmented approaches face challenges in addressing diverse queries and documents, particularly for medical knowledge queries.
We propose Self-Rewarding Tree Search (SeRTS) based on Monte Carlo Tree Search (MCTS) and a self-rewarding paradigm.
arXiv Detail & Related papers (2024-06-17T06:48:31Z) - Leveraging Large Language Models for Web Scraping [0.0]
This research investigates a general-purpose accurate data scraping recipe for RAG models designed for language generation.
To capture knowledge in a more modular and interpretable way, we use pre trained language models with a latent knowledge retriever.
arXiv Detail & Related papers (2024-06-12T14:15:15Z) - R4: Reinforced Retriever-Reorder-Responder for Retrieval-Augmented Large Language Models [32.598670876662375]
Retrieval-augmented large language models (LLMs) leverage relevant content retrieved by information retrieval systems to generate correct responses.
Existing retriever-responder methods typically append relevant documents to the prompt of LLMs to perform text generation tasks.
We propose a new pipeline named "Reinforced Retriever-Reorder-Responder" to learn document orderings for retrieval-augmented LLMs.
arXiv Detail & Related papers (2024-05-04T12:59:10Z) - Corpus-Steered Query Expansion with Large Language Models [35.64662397095323]
We introduce Corpus-Steered Query Expansion (CSQE) to promote the incorporation of knowledge embedded within the corpus.
CSQE utilizes the relevance assessing capability of LLMs to systematically identify pivotal sentences in the initially-retrieved documents.
Extensive experiments reveal that CSQE exhibits strong performance without necessitating any training, especially with queries for which LLMs lack knowledge.
arXiv Detail & Related papers (2024-02-28T03:58:58Z) - Query Rewriting for Retrieval-Augmented Large Language Models [139.242907155883]
Large Language Models (LLMs) play powerful, black-box readers in the retrieve-then-read pipeline.
This work introduces a new framework, Rewrite-Retrieve-Read instead of the previous retrieve-then-read for the retrieval-augmented LLMs.
arXiv Detail & Related papers (2023-05-23T17:27:50Z) - The Web Can Be Your Oyster for Improving Large Language Models [98.72358969495835]
Large language models (LLMs) encode a large amount of world knowledge.
We consider augmenting LLMs with the large-scale web using search engine.
We present a web-augmented LLM UNIWEB, which is trained over 16 knowledge-intensive tasks in a unified text-to-text format.
arXiv Detail & Related papers (2023-05-18T14:20:32Z) - Synergistic Interplay between Search and Large Language Models for
Information Retrieval [141.18083677333848]
InteR allows RMs to expand knowledge in queries using LLM-generated knowledge collections.
InteR achieves overall superior zero-shot retrieval performance compared to state-of-the-art methods.
arXiv Detail & Related papers (2023-05-12T11:58:15Z) - Query2doc: Query Expansion with Large Language Models [69.9707552694766]
The proposed method first generates pseudo- documents by few-shot prompting large language models (LLMs)
query2doc boosts the performance of BM25 by 3% to 15% on ad-hoc IR datasets.
Our method also benefits state-of-the-art dense retrievers in terms of both in-domain and out-of-domain results.
arXiv Detail & Related papers (2023-03-14T07:27:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.