SynDL: A Large-Scale Synthetic Test Collection for Passage Retrieval
- URL: http://arxiv.org/abs/2408.16312v3
- Date: Sat, 25 Jan 2025 16:30:57 GMT
- Title: SynDL: A Large-Scale Synthetic Test Collection for Passage Retrieval
- Authors: Hossein A. Rahmani, Xi Wang, Emine Yilmaz, Nick Craswell, Bhaskar Mitra, Paul Thomas,
- Abstract summary: We extend the TREC Deep Learning Track (DL) test collection via additional language model synthetic labels to enable researchers to test and evaluate their search systems at a large scale.
Specifically, such a test collection includes more than 1,900 test queries from the previous years of tracks.
We compare system evaluation with past human labels from past years and find that our synthetically created large-scale test collection can lead to highly correlated system rankings.
- Score: 30.269970599368815
- License:
- Abstract: Large-scale test collections play a crucial role in Information Retrieval (IR) research. However, according to the Cranfield paradigm and the research into publicly available datasets, the existing information retrieval research studies are commonly developed on small-scale datasets that rely on human assessors for relevance judgments - a time-intensive and expensive process. Recent studies have shown the strong capability of Large Language Models (LLMs) in producing reliable relevance judgments with human accuracy but at a greatly reduced cost. In this paper, to address the missing large-scale ad-hoc document retrieval dataset, we extend the TREC Deep Learning Track (DL) test collection via additional language model synthetic labels to enable researchers to test and evaluate their search systems at a large scale. Specifically, such a test collection includes more than 1,900 test queries from the previous years of tracks. We compare system evaluation with past human labels from past years and find that our synthetically created large-scale test collection can lead to highly correlated system rankings.
Related papers
- GenTREC: The First Test Collection Generated by Large Language Models for Evaluating Information Retrieval Systems [0.33748750222488655]
GenTREC is the first test collection constructed entirely from documents generated by a Large Language Model (LLM)
We consider a document relevant only to the prompt that generated it, while other document-topic pairs are treated as non-relevant.
The resulting GenTREC collection comprises 96,196 documents, 300 topics, and 18,964 relevance "judgments"
arXiv Detail & Related papers (2025-01-05T00:27:36Z) - On the Statistical Significance with Relevance Assessments of Large Language Models [2.9180406633632523]
We use Large Language Models for labelling relevance of documents for building new retrieval test collections.
Our results show that LLM judgements detect most of the significant differences while maintaining acceptable numbers of false positives.
Our work represents a step forward in the evaluation of statistical testing results provided by LLM judgements.
arXiv Detail & Related papers (2024-11-20T11:19:35Z) - Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored.
We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches.
We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - Synthetic Test Collections for Retrieval Evaluation [31.36035082257619]
Test collections play a vital role in evaluation of information retrieval (IR) systems.
We investigate whether it is possible to use Large Language Models (LLMs) to construct synthetic test collections.
Our experiments indicate that using LLMs it is possible to construct synthetic test collections that can reliably be used for retrieval evaluation.
arXiv Detail & Related papers (2024-05-13T14:11:09Z) - Scalable Learning of Item Response Theory Models [48.91265296134559]
Item Response Theory (IRT) models aim to assess latent abilities of $n$ examinees along with latent difficulty characteristics of $m$ test items from categorical data.
We leverage the similarity of these models to logistic regression, which can be approximated accurately using small weighted subsets called coresets.
arXiv Detail & Related papers (2024-03-01T17:12:53Z) - A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - A large dataset curation and benchmark for drug target interaction [0.7699646945563469]
Bioactivity data plays a key role in drug discovery and repurposing.
We propose a way to standardize and represent efficiently a very large dataset curated from multiple public sources.
arXiv Detail & Related papers (2024-01-30T17:06:25Z) - CSMeD: Bridging the Dataset Gap in Automated Citation Screening for
Systematic Literature Reviews [10.207938863784829]
We introduce CSMeD, a meta-dataset consolidating nine publicly released collections.
CSMeD serves as a comprehensive resource for training and evaluating the performance of automated citation screening models.
We introduce CSMeD-FT, a new dataset designed explicitly for evaluating the full text publication screening task.
arXiv Detail & Related papers (2023-11-21T09:36:11Z) - On the Efficacy of Adversarial Data Collection for Question Answering:
Results from a Large-Scale Randomized Study [65.17429512679695]
In adversarial data collection (ADC), a human workforce interacts with a model in real time, attempting to produce examples that elicit incorrect predictions.
Despite ADC's intuitive appeal, it remains unclear when training on adversarial datasets produces more robust models.
arXiv Detail & Related papers (2021-06-02T00:48:33Z) - Manual Evaluation Matters: Reviewing Test Protocols of Distantly
Supervised Relation Extraction [61.48964753725744]
We build manually-annotated test sets for two DS-RE datasets, NYT10 and Wiki20, and thoroughly evaluate several competitive models.
Results show that the manual evaluation can indicate very different conclusions from automatic ones.
arXiv Detail & Related papers (2021-05-20T06:55:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.