Related papers: InPars: Data Augmentation for Information Retrieval using Large Language Models

InPars: Data Augmentation for Information Retrieval using Large Language Models

URL: http://arxiv.org/abs/2202.05144v1
Date: Thu, 10 Feb 2022 16:52:45 GMT
Title: InPars: Data Augmentation for Information Retrieval using Large Language Models
Authors: Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, Rodrigo Nogueira
Abstract summary: In this work, we harness the few-shot capabilities of large pretrained language models as synthetic data generators for information retrieval tasks. We show that models finetuned solely on our unsupervised dataset outperform strong baselines such as BM25. retrievers finetuned on both supervised and our synthetic data achieve better zero-shot transfer than models finetuned only on supervised data.
Score: 5.851846467503597
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The information retrieval community has recently witnessed a revolution due to large pretrained transformer models. Another key ingredient for this revolution was the MS MARCO dataset, whose scale and diversity has enabled zero-shot transfer learning to various tasks. However, not all IR tasks and domains can benefit from one single dataset equally. Extensive research in various NLP tasks has shown that using domain-specific training data, as opposed to a general-purpose one, improves the performance of neural models. In this work, we harness the few-shot capabilities of large pretrained language models as synthetic data generators for IR tasks. We show that models finetuned solely on our unsupervised dataset outperform strong baselines such as BM25 as well as recently proposed self-supervised dense retrieval methods. Furthermore, retrievers finetuned on both supervised and our synthetic data achieve better zero-shot transfer than models finetuned only on supervised data. Code, models, and data are available at https://github.com/zetaalphavector/inpars .

Related papers

Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data. We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z)
Scaling Retrieval-Based Language Models with a Trillion-Token Datastore [85.4310806466002]
We find that increasing the size of the datastore used by a retrieval-based LM monotonically improves language modeling and several downstream tasks without obvious saturation. By plotting compute-optimal scaling curves with varied datastore, model, and pretraining data sizes, we show that using larger datastores can significantly improve model performance for the same training compute budget.
arXiv Detail & Related papers (2024-07-09T08:27:27Z)
A synthetic data approach for domain generalization of NLI models [13.840374911669167]
Natural Language Inference (NLI) remains an important benchmark task for LLMs. We show that synthetic high-quality datasets can adapt NLI models for zero-shot use in downstream applications. We show that models trained on this data have the best generalization to completely new downstream test settings.
arXiv Detail & Related papers (2024-02-19T18:55:16Z)
Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning [47.02160072880698]
We introduce a self-evolving mechanism that allows the model itself to actively sample subsets that are equally or even more effective. The key to our data sampling technique lies in the enhancement of diversity in the chosen subsets. Extensive experiments across three datasets and benchmarks demonstrate the effectiveness of DiverseEvol.
arXiv Detail & Related papers (2023-11-14T14:10:40Z)
Learning Defect Prediction from Unrealistic Data [57.53586547895278]
Pretrained models of code have become popular choices for code understanding and generation tasks. Such models tend to be large and require commensurate volumes of training data. It has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs. Models trained on such data tend to only perform well on similar data, while underperforming on real world programs.
arXiv Detail & Related papers (2023-11-02T01:51:43Z)
Synthetic Data Generation in Low-Resource Settings via Fine-Tuning of Large Language Models [15.991777903345575]
Large language models can generalize to novel downstream tasks with relatively few labeled examples. Alternatively, smaller models can solve specific tasks if fine-tuned with enough labeled examples. We study synthetic data generation of fine-tuning training data via fine-tuned teacher LLMs to improve the downstream performance of much smaller models.
arXiv Detail & Related papers (2023-10-02T11:49:05Z)
Reproducible, incremental representation learning with Rosetta VAE [0.0]
Variational autoencoders are among the most popular methods for distilling low-dimensional structure from high-dimensional data. We introduce the Rosetta VAE, a method of distilling previously learned representations and retraining new models to reproduce and build on prior results. We demonstrate that the R-VAE reconstructs data as well as the VAE and $beta$-VAE, outperforms both methods in recovery of a target latent space in a sequential training setting.
arXiv Detail & Related papers (2022-01-13T20:45:35Z)
Self-Supervised Pre-Training for Transformer-Based Person Re-Identification [54.55281692768765]
Transformer-based supervised pre-training achieves great performance in person re-identification (ReID) Due to the domain gap between ImageNet and ReID datasets, it usually needs a larger pre-training dataset to boost the performance. This work aims to mitigate the gap between the pre-training and ReID datasets from the perspective of data and model structure.
arXiv Detail & Related papers (2021-11-23T18:59:08Z)
Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples. We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models. We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z)
Parameter-Efficient Transfer from Sequential Behaviors for User Modeling and Recommendation [111.44445634272235]
In this paper, we develop a parameter efficient transfer learning architecture, termed as PeterRec. PeterRec allows the pre-trained parameters to remain unaltered during fine-tuning by injecting a series of re-learned neural networks. We perform extensive experimental ablation to show the effectiveness of the learned user representation in five downstream tasks.
arXiv Detail & Related papers (2020-01-13T14:09:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.