Pre-training with Large Language Model-based Document Expansion for
Dense Passage Retrieval
- URL: http://arxiv.org/abs/2308.08285v1
- Date: Wed, 16 Aug 2023 11:10:43 GMT
- Title: Pre-training with Large Language Model-based Document Expansion for
Dense Passage Retrieval
- Authors: Guangyuan Ma, Xing Wu, Peng Wang, Zijia Lin, Songlin Hu
- Abstract summary: We study the potential of pre-training with Large Language Model(LLM)-based document expansion for dense passage retrieval.
Our work shows strong zero-shot and out-of-domain retrieval abilities, making it more widely applicable for retrieval when initializing with no human-labeled data.
- Score: 28.906829093158592
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we systematically study the potential of pre-training with
Large Language Model(LLM)-based document expansion for dense passage retrieval.
Concretely, we leverage the capabilities of LLMs for document expansion, i.e.
query generation, and effectively transfer expanded knowledge to retrievers
using pre-training strategies tailored for passage retrieval. These strategies
include contrastive learning and bottlenecked query generation. Furthermore, we
incorporate a curriculum learning strategy to reduce the reliance on LLM
inferences. Experimental results demonstrate that pre-training with LLM-based
document expansion significantly boosts the retrieval performance on
large-scale web-search tasks. Our work shows strong zero-shot and out-of-domain
retrieval abilities, making it more widely applicable for retrieval when
initializing with no human-labeled data.
Related papers
- Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG [36.754491649652664]
Retrieval-augmented generation (RAG) empowers large language models (LLMs) to utilize external knowledge sources.
This paper investigates the detrimental impact of retrieved "hard negatives" as a key contributor.
To mitigate this and enhance the robustness of long-context LLM-based RAG, we propose both training-free and training-based approaches.
arXiv Detail & Related papers (2024-10-08T12:30:07Z) - Large Language Models as Foundations for Next-Gen Dense Retrieval: A Comprehensive Empirical Assessment [16.39696580487218]
Pretrained language models like BERT and T5 serve as crucial backbone encoders for dense retrieval.
Recent research has explored using large language models (LLMs) as retrievers, achieving SOTA performance across various tasks.
arXiv Detail & Related papers (2024-08-22T08:16:07Z) - On the Role of Long-tail Knowledge in Retrieval Augmented Large Language Models [33.08049246893537]
Retrieval augmented generation (RAG) exhibits outstanding performance in promoting the knowledge capabilities of large language models (LLMs)
We propose a simple but effective long-tail knowledge detection method for LLMs.
Our method achieves over 4x speedup in average inference time and consistent performance improvement in downstream tasks.
arXiv Detail & Related papers (2024-06-24T07:17:59Z) - R^2AG: Incorporating Retrieval Information into Retrieval Augmented Generation [11.890598082534577]
Retrieval augmented generation (RAG) has been applied in many scenarios to augment large language models (LLMs) with external documents provided by retrievers.
This paper proposes R$2$AG, a novel enhanced RAG framework that incorporates Retrieval information into Retrieval Augmented Generation.
arXiv Detail & Related papers (2024-06-19T06:19:48Z) - SeRTS: Self-Rewarding Tree Search for Biomedical Retrieval-Augmented Generation [50.26966969163348]
Large Language Models (LLMs) have shown great potential in the biomedical domain with the advancement of retrieval-augmented generation (RAG)
Existing retrieval-augmented approaches face challenges in addressing diverse queries and documents, particularly for medical knowledge queries.
We propose Self-Rewarding Tree Search (SeRTS) based on Monte Carlo Tree Search (MCTS) and a self-rewarding paradigm.
arXiv Detail & Related papers (2024-06-17T06:48:31Z) - PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval [76.50690734636477]
We propose PromptReps, which combines the advantages of both categories: no need for training and the ability to retrieve from the whole corpus.
The retrieval system harnesses both dense text embedding and sparse bag-of-words representations.
arXiv Detail & Related papers (2024-04-29T04:51:30Z) - Corpus-Steered Query Expansion with Large Language Models [35.64662397095323]
We introduce Corpus-Steered Query Expansion (CSQE) to promote the incorporation of knowledge embedded within the corpus.
CSQE utilizes the relevance assessing capability of LLMs to systematically identify pivotal sentences in the initially-retrieved documents.
Extensive experiments reveal that CSQE exhibits strong performance without necessitating any training, especially with queries for which LLMs lack knowledge.
arXiv Detail & Related papers (2024-02-28T03:58:58Z) - Query Rewriting for Retrieval-Augmented Large Language Models [139.242907155883]
Large Language Models (LLMs) play powerful, black-box readers in the retrieve-then-read pipeline.
This work introduces a new framework, Rewrite-Retrieve-Read instead of the previous retrieve-then-read for the retrieval-augmented LLMs.
arXiv Detail & Related papers (2023-05-23T17:27:50Z) - The Web Can Be Your Oyster for Improving Large Language Models [98.72358969495835]
Large language models (LLMs) encode a large amount of world knowledge.
We consider augmenting LLMs with the large-scale web using search engine.
We present a web-augmented LLM UNIWEB, which is trained over 16 knowledge-intensive tasks in a unified text-to-text format.
arXiv Detail & Related papers (2023-05-18T14:20:32Z) - Synergistic Interplay between Search and Large Language Models for
Information Retrieval [141.18083677333848]
InteR allows RMs to expand knowledge in queries using LLM-generated knowledge collections.
InteR achieves overall superior zero-shot retrieval performance compared to state-of-the-art methods.
arXiv Detail & Related papers (2023-05-12T11:58:15Z) - Query2doc: Query Expansion with Large Language Models [69.9707552694766]
The proposed method first generates pseudo- documents by few-shot prompting large language models (LLMs)
query2doc boosts the performance of BM25 by 3% to 15% on ad-hoc IR datasets.
Our method also benefits state-of-the-art dense retrievers in terms of both in-domain and out-of-domain results.
arXiv Detail & Related papers (2023-03-14T07:27:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.