FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents
- URL: http://arxiv.org/abs/2504.13128v1
- Date: Thu, 17 Apr 2025 17:44:06 GMT
- Title: FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents
- Authors: Nandan Thakur, Jimmy Lin, Sam Havens, Michael Carbin, Omar Khattab, Andrew Drozdov,
- Abstract summary: We introduce FreshStack, a reusable framework for building information retrieval benchmarks from community-asked questions and answers.<n>FreshStack conducts the following steps: automatic corpus collection from code and technical documentation, nugget generation from community-asked questions and answers, and nugget-level support.<n>We use FreshStack to build five datasets on fast-growing, recent, and niche topics to ensure the tasks are sufficiently challenging.
- Score: 53.5649975411777
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We introduce FreshStack, a reusable framework for automatically building information retrieval (IR) evaluation benchmarks from community-asked questions and answers. FreshStack conducts the following steps: (1) automatic corpus collection from code and technical documentation, (2) nugget generation from community-asked questions and answers, and (3) nugget-level support, retrieving documents using a fusion of retrieval techniques and hybrid architectures. We use FreshStack to build five datasets on fast-growing, recent, and niche topics to ensure the tasks are sufficiently challenging. On FreshStack, existing retrieval models, when applied out-of-the-box, significantly underperform oracle approaches on all five topics, denoting plenty of headroom to improve IR quality. In addition, we identify cases where rerankers do not clearly improve first-stage retrieval accuracy (two out of five topics). We hope that FreshStack will facilitate future work toward constructing realistic, scalable, and uncontaminated IR and RAG evaluation benchmarks. FreshStack datasets are available at: https://fresh-stack.github.io.
Related papers
- Rankify: A Comprehensive Python Toolkit for Retrieval, Re-Ranking, and Retrieval-Augmented Generation [15.31883349259767]
Rankify is an open-source toolkit designed to unify retrieval, re-ranking, and RAG within a cohesive framework.<n>It supports a wide range of retrieval techniques, including dense and sparse retrievers, while incorporating state-of-the-art re-ranking models.<n>Rankify includes a collection of pre-retrieved datasets to facilitate benchmarking, available at Huggingface.
arXiv Detail & Related papers (2025-02-04T16:33:25Z) - Hierarchical Retrieval-Augmented Generation Model with Rethink for Multi-hop Question Answering [24.71247954169364]
Multi-hop Question Answering (QA) necessitates complex reasoning by integrating multiple pieces of information to resolve intricate questions.
Existing QA systems encounter challenges such as outdated information, context window length limitations, and an accuracy-quantity trade-off.
We propose a novel framework, the Hierarchical Retrieval-Augmented Generation Model with Rethink (HiRAG), comprising Decomposer, Definer, Retriever, Filter, and Summarizer five key modules.
arXiv Detail & Related papers (2024-08-20T09:29:31Z) - BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval [54.54576644403115]
We introduce BRIGHT, the first text retrieval benchmark that requires intensive reasoning to retrieve relevant documents.<n>Our dataset consists of 1,384 real-world queries spanning diverse domains, such as economics, psychology, mathematics, and coding.<n>We show that incorporating explicit reasoning about the query improves retrieval performance by up to 12.2 points.
arXiv Detail & Related papers (2024-07-16T17:58:27Z) - ReFeR: Improving Evaluation and Reasoning through Hierarchy of Models [12.035509884945789]
We introduce a tuning-free framework called ReFeR, designed to evaluate generative outputs, including both text and images.
We rigorously evaluate our framework, ReFeR, across four diverse evaluation tasks.
Experiments on four reasoning tasks demonstrate superior collective reasoning abilities of the framework.
arXiv Detail & Related papers (2024-07-16T08:25:26Z) - Benchmarking Predictive Coding Networks -- Made Simple [48.652114040426625]
We tackle the problems of efficiency and scalability for predictive coding networks (PCNs) in machine learning.<n>We propose a library, called PCX, that focuses on performance and simplicity, and use it to implement a large set of standard benchmarks.<n>We perform extensive tests on such benchmarks using both existing algorithms for PCNs, as well as adaptations of other methods popular in the bio-plausible deep learning community.
arXiv Detail & Related papers (2024-07-01T10:33:44Z) - Ragnarök: A Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented Generation Track [51.25144287084172]
It is crucial to have an arena to build, test, visualize, and systematically evaluate RAG-based search systems.
We propose the TREC 2024 RAG Track to foster innovation in evaluating RAG systems.
arXiv Detail & Related papers (2024-06-24T17:37:52Z) - CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation.<n>We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks.<n>We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z) - Interleaving Retrieval with Chain-of-Thought Reasoning for
Knowledge-Intensive Multi-Step Questions [50.114651561111245]
We propose IRCoT, a new approach for multi-step question answering.
It interleaves retrieval with steps in a CoT, guiding the retrieval with CoT and in turn using retrieved results to improve CoT.
arXiv Detail & Related papers (2022-12-20T18:26:34Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z) - Answering Open-Domain Questions of Varying Reasoning Steps from Text [39.48011017748654]
We develop a unified system to answer directly from text open-domain questions.
We employ a single multi-task transformer model to perform all the necessary subtasks.
We show that our model demonstrates competitive performance on both existing benchmarks and this new benchmark.
arXiv Detail & Related papers (2020-10-23T16:51:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.