InPars-v2: Large Language Models as Efficient Dataset Generators for
Information Retrieval
- URL: http://arxiv.org/abs/2301.01820v4
- Date: Fri, 26 May 2023 22:16:38 GMT
- Title: InPars-v2: Large Language Models as Efficient Dataset Generators for
Information Retrieval
- Authors: Vitor Jeronymo, Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, Roberto
Lotufo, Jakub Zavrel, Rodrigo Nogueira
- Abstract summary: We introduce InPars-v2, a dataset generator that uses open-source LLMs and powerful rerankers to select synthetic query-document pairs for training.
A simple BM25 retrieval pipeline followed by a monoT5 reranker finetuned on InPars-v2 data achieves new state-of-the-art results on the BEIR benchmark.
- Score: 4.888022358881737
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, InPars introduced a method to efficiently use large language models
(LLMs) in information retrieval tasks: via few-shot examples, an LLM is induced
to generate relevant queries for documents. These synthetic query-document
pairs can then be used to train a retriever. However, InPars and, more
recently, Promptagator, rely on proprietary LLMs such as GPT-3 and FLAN to
generate such datasets. In this work we introduce InPars-v2, a dataset
generator that uses open-source LLMs and existing powerful rerankers to select
synthetic query-document pairs for training. A simple BM25 retrieval pipeline
followed by a monoT5 reranker finetuned on InPars-v2 data achieves new
state-of-the-art results on the BEIR benchmark. To allow researchers to further
improve our method, we open source the code, synthetic data, and finetuned
models: https://github.com/zetaalphavector/inPars/tree/master/tpu
Related papers
- Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQA [51.3033125256716]
We model the subgraph retrieval task as a conditional generation task handled by small language models.
Our base generative subgraph retrieval model, consisting of only 220M parameters, competitive retrieval performance compared to state-of-the-art models.
Our largest 3B model, when plugged with an LLM reader, sets new SOTA end-to-end performance on both the WebQSP and CWQ benchmarks.
arXiv Detail & Related papers (2024-10-08T15:22:36Z) - PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval [76.50690734636477]
We propose PromptReps, which combines the advantages of both categories: no need for training and the ability to retrieve from the whole corpus.
The retrieval system harnesses both dense text embedding and sparse bag-of-words representations.
arXiv Detail & Related papers (2024-04-29T04:51:30Z) - Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval [56.65147231836708]
We develop SWIM-IR, a synthetic retrieval training dataset containing 33 languages for fine-tuning multilingual dense retrievers.
SAP assists the large language model (LLM) in generating informative queries in the target language.
Our models, called SWIM-X, are competitive with human-supervised dense retrieval models.
arXiv Detail & Related papers (2023-11-10T00:17:10Z) - SEED: Domain-Specific Data Curation With Large Language Models [22.54280367957015]
We present SEED, an LLM-as-compiler approach that automatically generates domain-specific data curation solutions via Large Language Models (LLMs)
SEED features an that automatically selects from the four LLM-assisted modules and forms a hybrid execution pipeline that best fits the task at hand.
arXiv Detail & Related papers (2023-10-01T17:59:20Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z) - Generator-Retriever-Generator Approach for Open-Domain Question Answering [18.950517545413813]
We propose a novel approach that combines document retrieval techniques with a large language model (LLM)
In parallel, a dual-encoder network retrieves documents that are relevant to the question from an external corpus.
GRG outperforms the state-of-the-art generate-then-read and retrieve-then-read pipelines.
arXiv Detail & Related papers (2023-07-21T00:34:38Z) - Large Language Models are Strong Zero-Shot Retriever [89.16756291653371]
We propose a simple method that applies a large language model (LLM) to large-scale retrieval in zero-shot scenarios.
Our method, the Language language model as Retriever (LameR), is built upon no other neural models but an LLM.
arXiv Detail & Related papers (2023-04-27T14:45:55Z) - Query2doc: Query Expansion with Large Language Models [69.9707552694766]
The proposed method first generates pseudo- documents by few-shot prompting large language models (LLMs)
query2doc boosts the performance of BM25 by 3% to 15% on ad-hoc IR datasets.
Our method also benefits state-of-the-art dense retrievers in terms of both in-domain and out-of-domain results.
arXiv Detail & Related papers (2023-03-14T07:27:30Z) - Evaluating the Impact of Source Code Parsers on ML4SE Models [3.699097874146491]
We evaluate two models, namely, Supernorm2Seq and TreeLSTM, in the name prediction language.
We show that trees built by differents vary in their structure and content.
We then analyze how this diversity affects the models' quality.
arXiv Detail & Related papers (2022-06-17T12:10:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.