Related papers: Synthetic Test Collections for Retrieval Evaluation

Synthetic Test Collections for Retrieval Evaluation

URL: http://arxiv.org/abs/2405.07767v1
Date: Mon, 13 May 2024 14:11:09 GMT
Title: Synthetic Test Collections for Retrieval Evaluation
Authors: Hossein A. Rahmani, Nick Craswell, Emine Yilmaz, Bhaskar Mitra, Daniel Campos,
Abstract summary: Test collections play a vital role in evaluation of information retrieval (IR) systems. We investigate whether it is possible to use Large Language Models (LLMs) to construct synthetic test collections. Our experiments indicate that using LLMs it is possible to construct synthetic test collections that can reliably be used for retrieval evaluation.
Score: 31.36035082257619
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Test collections play a vital role in evaluation of information retrieval (IR) systems. Obtaining a diverse set of user queries for test collection construction can be challenging, and acquiring relevance judgments, which indicate the appropriateness of retrieved documents to a query, is often costly and resource-intensive. Generating synthetic datasets using Large Language Models (LLMs) has recently gained significant attention in various applications. In IR, while previous work exploited the capabilities of LLMs to generate synthetic queries or documents to augment training data and improve the performance of ranking models, using LLMs for constructing synthetic test collections is relatively unexplored. Previous studies demonstrate that LLMs have the potential to generate synthetic relevance judgments for use in the evaluation of IR systems. In this paper, we comprehensively investigate whether it is possible to use LLMs to construct fully synthetic test collections by generating not only synthetic judgments but also synthetic queries. In particular, we analyse whether it is possible to construct reliable synthetic test collections and the potential risks of bias such test collections may exhibit towards LLM-based models. Our experiments indicate that using LLMs it is possible to construct synthetic test collections that can reliably be used for retrieval evaluation.

Related papers

Automated Test Suite Enhancement Using Large Language Models with Few-shot Prompting [0.0]
Unit testing is essential for verifying the functional correctness of code modules.<n>Unit tests generated by tools that employ traditional approaches, such as search-based software testing (SBST), lack readability, naturalness, and practical usability.<n>Software repositories now include a mix of human-written tests, LLM-generated tests, and those from tools employing traditional approaches such as SBST.
arXiv Detail & Related papers (2026-02-12T18:42:49Z)
Towards Understanding Bias in Synthetic Data for Evaluation [26.50462114230235]
We investigate the reliability of synthetic test collections constructed using Large Language Models (LLMs)<n>We first empirically show the presence of such bias in evaluation results and analyse the effects it might have on system evaluation.<n>Our analysis shows that while the effect of bias present in evaluation results obtained using synthetic test collections could be significant, for e.g.computing absolute system performance, its effect may not be as significant in comparing relative system performance.
arXiv Detail & Related papers (2025-06-12T02:25:42Z)
IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z)
REANIMATOR: Reanimate Retrieval Test Collections with Extracted and Synthetic Resources [1.1309478649967237]
We introduce REANIMATOR, a versatile framework designed to enable the repurposing of existing test collections. It enhances test collections from PDF files by parsing full texts and machine-readable tables. It then employs state-of-the-art large language models to produce synthetic relevance labels.
arXiv Detail & Related papers (2025-04-10T09:25:11Z)
Unleashing the Power of LLMs in Dense Retrieval with Query Likelihood Modeling [69.84963245729826]
We propose an auxiliary task of QL to enhance the backbone for subsequent contrastive learning of the retriever.<n>We introduce our model, which incorporates two key components: Attention Block (AB) and Document Corruption (DC)
arXiv Detail & Related papers (2025-04-07T16:03:59Z)
Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning [59.25951947621526]
We propose an approach which can transform existing coding benchmarks into scoring and ranking datasets to evaluate the effectiveness of synthetic verifiers. We release four new benchmarks (HE-R, HE-R+, MBPP-R, and MBPP-R+), and analyzed synthetic verification methods with standard, reasoning-based, and reward-based LLMs. Our experiments show that reasoning can significantly improve test case generation and that scaling the number of test cases enhances the verification accuracy.
arXiv Detail & Related papers (2025-02-19T15:32:11Z)
A Reproducibility and Generalizability Study of Large Language Models for Query Generation [14.172158182496295]
generative AI and large language models (LLMs) promise to revolutionize the systematic literature review process. This paper presents an extensive study of Boolean query generation using LLMs for systematic reviews. Our study investigates the replicability and reliability of results achieved using ChatGPT. We then generalize our results by analyzing and evaluating open-source models.
arXiv Detail & Related papers (2024-11-22T13:15:03Z)
Data Fusion of Synthetic Query Variants With Generative Large Language Models [1.864807003137943]
This work explores the feasibility of using synthetic query variants generated by instruction-tuned Large Language Models in data fusion experiments. We introduce a lightweight, unsupervised, and cost-efficient approach that exploits principled prompting and data fusion techniques. Our analysis shows that data fusion based on synthetic query variants is significantly better than baselines with single queries and also outperforms pseudo-relevance feedback methods.
arXiv Detail & Related papers (2024-11-06T12:54:27Z)
Unleashing the Power of Large Language Models in Zero-shot Relation Extraction via Self-Prompting [21.04933334040135]
We introduce the Self-Prompting framework, a novel method designed to fully harness the embedded RE knowledge within Large Language Models. Our framework employs a three-stage diversity approach to prompt LLMs, generating multiple synthetic samples that encapsulate specific relations from scratch. Experimental evaluations on benchmark datasets show our approach outperforms existing LLM-based zero-shot RE methods.
arXiv Detail & Related papers (2024-10-02T01:12:54Z)
BERGEN: A Benchmarking Library for Retrieval-Augmented Generation [26.158785168036662]
Retrieval-Augmented Generation allows to enhance Large Language Models with external knowledge. Inconsistent benchmarking poses a major challenge in comparing approaches and understanding the impact of each component in the pipeline. In this work, we study best practices that lay the groundwork for a systematic evaluation of RAG and present BERGEN, an end-to-end library for reproducible research standardizing RAG experiments.
arXiv Detail & Related papers (2024-07-01T09:09:27Z)
DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity. Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data. Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z)
Automatic benchmarking of large multimodal models via iterative experiment programming [71.78089106671581]
We present APEx, the first framework for automatic benchmarking of LMMs. Given a research question expressed in natural language, APEx leverages a large language model (LLM) and a library of pre-specified tools to generate a set of experiments for the model at hand. The report drives the testing procedure: based on the current status of the investigation, APEx chooses which experiments to perform and whether the results are sufficient to draw conclusions.
arXiv Detail & Related papers (2024-06-18T06:43:46Z)
STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases [93.96463520716759]
We develop STARK, a large-scale Semi-structure retrieval benchmark on Textual and Knowledge Bases. Our benchmark covers three domains: product search, academic paper search, and queries in precision medicine. We design a novel pipeline to synthesize realistic user queries that integrate diverse relational information and complex textual properties.
arXiv Detail & Related papers (2024-04-19T22:54:54Z)
Mitigating Catastrophic Forgetting in Large Language Models with Self-Synthesized Rehearsal [49.24054920683246]
Large language models (LLMs) suffer from catastrophic forgetting during continual learning. We propose a framework called Self-Synthesized Rehearsal (SSR) that uses the LLM to generate synthetic instances for rehearsal.
arXiv Detail & Related papers (2024-03-02T16:11:23Z)
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization. Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z)
ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases. We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets. Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.