Large Language Models are Built-in Autoregressive Search Engines
- URL: http://arxiv.org/abs/2305.09612v1
- Date: Tue, 16 May 2023 17:04:48 GMT
- Title: Large Language Models are Built-in Autoregressive Search Engines
- Authors: Noah Ziems, Wenhao Yu, Zhihan Zhang, Meng Jiang
- Abstract summary: Large language models (LLMs) can follow human instructions to directly generate URLs for document retrieval.
LLMs can generate Web URLs where nearly 90% of the corresponding documents contain correct answers to open-domain questions.
- Score: 19.928494069013485
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Document retrieval is a key stage of standard Web search engines. Existing
dual-encoder dense retrievers obtain representations for questions and
documents independently, allowing for only shallow interactions between them.
To overcome this limitation, recent autoregressive search engines replace the
dual-encoder architecture by directly generating identifiers for relevant
documents in the candidate pool. However, the training cost of such
autoregressive search engines rises sharply as the number of candidate
documents increases. In this paper, we find that large language models (LLMs)
can follow human instructions to directly generate URLs for document retrieval.
Surprisingly, when providing a few {Query-URL} pairs as in-context
demonstrations, LLMs can generate Web URLs where nearly 90\% of the
corresponding documents contain correct answers to open-domain questions. In
this way, LLMs can be thought of as built-in search engines, since they have
not been explicitly trained to map questions to document identifiers.
Experiments demonstrate that our method can consistently achieve better
retrieval performance than existing retrieval approaches by a significant
margin on three open-domain question answering benchmarks, under both zero and
few-shot settings. The code for this work can be found at
\url{https://github.com/Ziems/llm-url}.
Related papers
- PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval [76.50690734636477]
We propose PromptReps, which combines the advantages of both categories: no need for training and the ability to retrieve from the whole corpus.
The retrieval system harnesses both dense text embedding and sparse bag-of-words representations.
arXiv Detail & Related papers (2024-04-29T04:51:30Z) - Harnessing Multi-Role Capabilities of Large Language Models for
Open-Domain Question Answering [40.2758450304531]
Open-domain question answering (ODQA) has emerged as a pivotal research spotlight in information systems.
We propose a framework that formulates the ODQA process into three basic steps: query expansion, document selection, and answer generation.
We introduce a novel prompt optimization algorithm to refine role-playing prompts and steer LLMs to produce higher-quality evidence and answers.
arXiv Detail & Related papers (2024-03-08T11:09:13Z) - Generator-Retriever-Generator Approach for Open-Domain Question Answering [18.950517545413813]
We propose a novel approach that combines document retrieval techniques with a large language model (LLM)
In parallel, a dual-encoder network retrieves documents that are relevant to the question from an external corpus.
GRG outperforms the state-of-the-art generate-then-read and retrieve-then-read pipelines.
arXiv Detail & Related papers (2023-07-21T00:34:38Z) - DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR)
While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context.
Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z) - Query2doc: Query Expansion with Large Language Models [69.9707552694766]
The proposed method first generates pseudo- documents by few-shot prompting large language models (LLMs)
query2doc boosts the performance of BM25 by 3% to 15% on ad-hoc IR datasets.
Our method also benefits state-of-the-art dense retrievers in terms of both in-domain and out-of-domain results.
arXiv Detail & Related papers (2023-03-14T07:27:30Z) - Generate rather than Retrieve: Large Language Models are Strong Context
Generators [74.87021992611672]
We present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators.
We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer.
arXiv Detail & Related papers (2022-09-21T01:30:59Z) - Learning Diverse Document Representations with Deep Query Interactions
for Dense Retrieval [79.37614949970013]
We propose a new dense retrieval model which learns diverse document representations with deep query interactions.
Our model encodes each document with a set of generated pseudo-queries to get query-informed, multi-view document representations.
arXiv Detail & Related papers (2022-08-08T16:00:55Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.