REANIMATOR: Reanimate Retrieval Test Collections with Extracted and Synthetic Resources
- URL: http://arxiv.org/abs/2504.07584v1
- Date: Thu, 10 Apr 2025 09:25:11 GMT
- Title: REANIMATOR: Reanimate Retrieval Test Collections with Extracted and Synthetic Resources
- Authors: Björn Engelmann, Fabian Haak, Philipp Schaer, Mani Erfanian Abdoust, Linus Netze, Meik Bittkowski,
- Abstract summary: We introduce REANIMATOR, a versatile framework designed to enable the repurposing of existing test collections.<n>It enhances test collections from PDF files by parsing full texts and machine-readable tables.<n>It then employs state-of-the-art large language models to produce synthetic relevance labels.
- Score: 1.1309478649967237
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Retrieval test collections are essential for evaluating information retrieval systems, yet they often lack generalizability across tasks. To overcome this limitation, we introduce REANIMATOR, a versatile framework designed to enable the repurposing of existing test collections by enriching them with extracted and synthetic resources. REANIMATOR enhances test collections from PDF files by parsing full texts and machine-readable tables, as well as related contextual information. It then employs state-of-the-art large language models to produce synthetic relevance labels. Including an optional human-in-the-loop step can help validate the resources that have been extracted and generated. We demonstrate its potential with a revitalized version of the TREC-COVID test collection, showcasing the development of a retrieval-augmented generation system and evaluating the impact of tables on retrieval-augmented generation. REANIMATOR enables the reuse of test collections for new applications, lowering costs and broadening the utility of legacy resources.
Related papers
- Learning More Effective Representations for Dense Retrieval through Deliberate Thinking Before Search [65.53881294642451]
Deliberate Thinking based Dense Retriever (DEBATER)
DEBATER enhances recent dense retrievers by enabling them to learn more effective document representations through a step-by-step thinking process.
Experimental results show that DEBATER significantly outperforms existing methods across several retrieval benchmarks.
arXiv Detail & Related papers (2025-02-18T15:56:34Z) - GenTREC: The First Test Collection Generated by Large Language Models for Evaluating Information Retrieval Systems [0.33748750222488655]
GenTREC is the first test collection constructed entirely from documents generated by a Large Language Model (LLM)<n>We consider a document relevant only to the prompt that generated it, while other document-topic pairs are treated as non-relevant.<n>The resulting GenTREC collection comprises 96,196 documents, 300 topics, and 18,964 relevance "judgments"
arXiv Detail & Related papers (2025-01-05T00:27:36Z) - A Reproducibility and Generalizability Study of Large Language Models for Query Generation [14.172158182496295]
generative AI and large language models (LLMs) promise to revolutionize the systematic literature review process.
This paper presents an extensive study of Boolean query generation using LLMs for systematic reviews.
Our study investigates the replicability and reliability of results achieved using ChatGPT.
We then generalize our results by analyzing and evaluating open-source models.
arXiv Detail & Related papers (2024-11-22T13:15:03Z) - Beyond Content Relevance: Evaluating Instruction Following in Retrieval Models [25.301280441283147]
This study evaluates the instruction-following capabilities of various retrieval models beyond content relevance.<n>We develop a novel retrieval evaluation benchmark spanning six document-level attributes.<n>Our findings indicate that although fine-tuning models on instruction-aware retrieval datasets enhance performance, most models still fall short of instruction compliance.
arXiv Detail & Related papers (2024-10-31T11:47:21Z) - ArxivDIGESTables: Synthesizing Scientific Literature into Tables using Language Models [58.34560740973768]
We introduce a framework that leverages language models (LMs) to generate literature review tables.
A new dataset of 2,228 literature review tables extracted from ArXiv papers synthesize a total of 7,542 research papers.
We evaluate LMs' abilities to reconstruct reference tables, finding this task benefits from additional context.
arXiv Detail & Related papers (2024-10-25T18:31:50Z) - BERGEN: A Benchmarking Library for Retrieval-Augmented Generation [26.158785168036662]
Retrieval-Augmented Generation allows to enhance Large Language Models with external knowledge.
Inconsistent benchmarking poses a major challenge in comparing approaches and understanding the impact of each component in the pipeline.
In this work, we study best practices that lay the groundwork for a systematic evaluation of RAG and present BERGEN, an end-to-end library for reproducible research standardizing RAG experiments.
arXiv Detail & Related papers (2024-07-01T09:09:27Z) - Synthetic Test Collections for Retrieval Evaluation [31.36035082257619]
Test collections play a vital role in evaluation of information retrieval (IR) systems.
We investigate whether it is possible to use Large Language Models (LLMs) to construct synthetic test collections.
Our experiments indicate that using LLMs it is possible to construct synthetic test collections that can reliably be used for retrieval evaluation.
arXiv Detail & Related papers (2024-05-13T14:11:09Z) - STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases [93.96463520716759]
We develop STARK, a large-scale Semi-structure retrieval benchmark on Textual and Knowledge Bases.
Our benchmark covers three domains: product search, academic paper search, and queries in precision medicine.
We design a novel pipeline to synthesize realistic user queries that integrate diverse relational information and complex textual properties.
arXiv Detail & Related papers (2024-04-19T22:54:54Z) - ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems [46.522527144802076]
We introduce ARES, an Automated RAG Evaluation System, for evaluating RAG systems.
ARES finetunes lightweight LM judges to assess the quality of individual RAG components.
We make our code and datasets publicly available on Github.
arXiv Detail & Related papers (2023-11-16T00:39:39Z) - Evaluating Generative Ad Hoc Information Retrieval [58.800799175084286]
generative retrieval systems often directly return a grounded generated text as a response to a query.
Quantifying the utility of the textual responses is essential for appropriately evaluating such generative ad hoc retrieval.
arXiv Detail & Related papers (2023-11-08T14:05:00Z) - Enhancing Retrieval-Augmented Large Language Models with Iterative
Retrieval-Generation Synergy [164.83371924650294]
We show that strong performance can be achieved by a method we call Iter-RetGen, which synergizes retrieval and generation in an iterative manner.
A model output shows what might be needed to finish a task, and thus provides an informative context for retrieving more relevant knowledge.
Iter-RetGen processes all retrieved knowledge as a whole and largely preserves the flexibility in generation without structural constraints.
arXiv Detail & Related papers (2023-05-24T16:17:36Z) - Active Retrieval Augmented Generation [123.68874416084499]
Augmenting large language models (LMs) by retrieving information from external knowledge resources is one promising solution.
Most existing retrieval augmented LMs employ a retrieve-and-generate setup that only retrieves information once based on the input.
We propose Forward-Looking Active REtrieval augmented generation (FLARE), a generic method which iteratively uses a prediction of the upcoming sentence to anticipate future content.
arXiv Detail & Related papers (2023-05-11T17:13:40Z) - Does Recommend-Revise Produce Reliable Annotations? An Analysis on
Missing Instances in DocRED [60.39125850987604]
We show that a textit-revise scheme results in false negative samples and an obvious bias towards popular entities and relations.
The relabeled dataset is released to serve as a more reliable test set of document RE models.
arXiv Detail & Related papers (2022-04-17T11:29:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.