WindTunnel -- A Framework for Community Aware Sampling of Large Corpora
- URL: http://arxiv.org/abs/2410.20301v1
- Date: Sun, 27 Oct 2024 00:49:52 GMT
- Title: WindTunnel -- A Framework for Community Aware Sampling of Large Corpora
- Authors: Michael Iannelli,
- Abstract summary: WindTunnel is a framework developed at Yext to generate representative samples of large corpora.
WindTunnel overcomes limitations in current sampling methods, providing more accurate evaluations.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Conducting comprehensive information retrieval experiments, such as in search or retrieval augmented generation, often comes with high computational costs. This is because evaluating a retrieval algorithm requires indexing the entire corpus, which is significantly larger than the set of (query, result) pairs under evaluation. This issue is especially pronounced in big data and neural retrieval, where indexing becomes increasingly time-consuming and complex. In this paper, we present WindTunnel, a novel framework developed at Yext to generate representative samples of large corpora, enabling efficient end-to-end information retrieval experiments. By preserving the community structure of the dataset, WindTunnel overcomes limitations in current sampling methods, providing more accurate evaluations.
Related papers
- Forward Index Compression for Learned Sparse Retrieval [15.629655228398567]
We focus on the size of a data structure that is common to all algorithmic flavors and that constitutes a substantial fraction of the overall index size: the forward index.<n>In particular, we seek compression techniques to reduce the storage footprint of the forward index without compromising search quality or inner product computation latency.
arXiv Detail & Related papers (2026-02-05T08:35:17Z) - Query Decomposition for RAG: Balancing Exploration-Exploitation [83.79639293409802]
RAG systems address complex user requests by decomposing them into subqueries, retrieving potentially relevant documents for each, and then aggregating them to generate an answer.<n>We formulate query decomposition and document retrieval in an exploitation-exploration setting, where retrieving one document at a time builds a belief about the utility of a given sub-queries.<n>Our main finding is that estimating document relevance using rank information and human judgments yields a 35% gain in document-level precision, 15% increase in alpha-nDCG, and better performance on the downstream task of long-form generation.
arXiv Detail & Related papers (2025-10-21T13:37:11Z) - Test-time Corpus Feedback: From Retrieval to RAG [21.517949407443453]
Retrieval-Augmented Generation (RAG) has emerged as a standard framework for knowledge-intensive NLP tasks.<n>Most RAG pipelines treat retrieval and reasoning as isolated components, retrieving documents once and then generating answers without further interaction.<n>Recent work in both the information retrieval (IR) and NLP communities has begun to close this gap by introducing adaptive retrieval and ranking methods that incorporate feedback.
arXiv Detail & Related papers (2025-08-21T10:57:38Z) - Deep Researcher with Test-Time Diffusion [32.375428487905104]
Test-Time Diffusion Deep Researcher conceptualizes research report generation as a diffusion process.<n>Draft-centric design makes the report writing process more timely and coherent.<n>We demonstrate that our TTD-DR achieves state-of-the-art results on a wide array of benchmarks.
arXiv Detail & Related papers (2025-07-21T21:23:21Z) - A Query-Driven Approach to Space-Efficient Range Searching [12.760453906939446]
We show that a near-linear sample of queries allows the construction of a partition tree with a near-optimal expected number of nodes visited during querying.
We enhance this approach by treating node processing as a classification problem, leveraging fast classifiers like shallow neural networks to obtain experimentally efficient query times.
Our algorithm, based on a sample of queries, builds a balanced tree with nodes associated with separators that minimize query stabs on expectation.
arXiv Detail & Related papers (2025-02-19T12:01:00Z) - Generating Realistic Synthetic Head Rotation Data for Extended Reality using Deep Learning [12.131070527836005]
We present a head rotation time series generator based on TimeGAN, an extension of the well-known Generative Adversarial Network.
This approach is able to extend a dataset of head rotations with new samples closely matching the distribution of the measured time series.
arXiv Detail & Related papers (2025-01-15T12:14:15Z) - Unsupervised Query Routing for Retrieval Augmented Generation [64.47987041500966]
We introduce a novel unsupervised method that constructs the "upper-bound" response to evaluate the quality of retrieval-augmented responses.
This evaluation enables the decision of the most suitable search engine for a given query.
By eliminating manual annotations, our approach can automatically process large-scale real user queries and create training data.
arXiv Detail & Related papers (2025-01-14T02:27:06Z) - ConTReGen: Context-driven Tree-structured Retrieval for Open-domain Long-form Text Generation [26.4086456393314]
Long-form text generation requires coherent, comprehensive responses that address complex queries with both breadth and depth.
Existing iterative retrieval-augmented generation approaches often struggle to delve deeply into each facet of complex queries.
This paper introduces ConTReGen, a novel framework that employs a context-driven, tree-structured retrieval approach.
arXiv Detail & Related papers (2024-10-20T21:17:05Z) - Operational Advice for Dense and Sparse Retrievers: HNSW, Flat, or Inverted Indexes? [62.57689536630933]
We provide experimental results on the BEIR dataset using the open-source Lucene search library.
Our results provide guidance for today's search practitioner in understanding the design space of dense and sparse retrievers.
arXiv Detail & Related papers (2024-09-10T12:46:23Z) - Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations [8.796275989527054]
We propose a novel organization of the inverted index that enables fast retrieval over learned sparse embeddings.
Our approach organizes inverted lists into geometrically-cohesive blocks, each equipped with a summary vector.
Our results indicate that Seismic is one to two orders of magnitude faster than state-of-the-art inverted index-based solutions.
arXiv Detail & Related papers (2024-04-29T15:49:27Z) - Concurrent Brainstorming & Hypothesis Satisfying: An Iterative Framework
for Enhanced Retrieval-Augmented Generation (R2CBR3H-SR) [0.456877715768796]
This study introduces an innovative, iterative retrieval-augmented generation system.
Our approach uniquely integrates a vector-space driven re-ranking mechanism with concurrent brainstorming to expedite the retrieval of highly relevant documents.
This research advances the state-of-the-art in intelligent retrieval systems, setting a new benchmark for resource-efficient information extraction and abstraction in knowledge-intensive applications.
arXiv Detail & Related papers (2024-01-03T17:01:44Z) - Dense X Retrieval: What Retrieval Granularity Should We Use? [56.90827473115201]
Often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence.
We introduce a novel retrieval unit, proposition, for dense retrieval.
Experiments reveal that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks.
arXiv Detail & Related papers (2023-12-11T18:57:35Z) - How Does Generative Retrieval Scale to Millions of Passages? [68.98628807288972]
We conduct the first empirical study of generative retrieval techniques across various corpus scales.
We scale generative retrieval to millions of passages with a corpus of 8.8M passages and evaluating model sizes up to 11B parameters.
While generative retrieval is competitive with state-of-the-art dual encoders on small corpora, scaling to millions of passages remains an important and unsolved challenge.
arXiv Detail & Related papers (2023-05-19T17:33:38Z) - Improving Out-of-Distribution Generalization of Neural Rerankers with
Contextualized Late Interaction [52.63663547523033]
Late interaction, the simplest form of multi-vector, is also helpful to neural rerankers that only use the [] vector to compute the similarity score.
We show that the finding is consistent across different model sizes and first-stage retrievers of diverse natures.
arXiv Detail & Related papers (2023-02-13T18:42:17Z) - CorpusBrain: Pre-train a Generative Retrieval Model for
Knowledge-Intensive Language Tasks [62.22920673080208]
Single-step generative model can dramatically simplify the search process and be optimized in end-to-end manner.
We name the pre-trained generative retrieval model as CorpusBrain as all information about the corpus is encoded in its parameters without the need of constructing additional index.
arXiv Detail & Related papers (2022-08-16T10:22:49Z) - Progressively Pretrained Dense Corpus Index for Open-Domain Question
Answering [87.32442219333046]
We propose a simple and resource-efficient method to pretrain the paragraph encoder.
Our method outperforms an existing dense retrieval method that uses 7 times more computational resources for pretraining.
arXiv Detail & Related papers (2020-04-30T18:09:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.