Related papers: Reproducible Hybrid Time-Travel Retrieval in Evolving Corpora

Reproducible Hybrid Time-Travel Retrieval in Evolving Corpora

URL: http://arxiv.org/abs/2411.04051v1
Date: Wed, 06 Nov 2024 16:57:55 GMT
Title: Reproducible Hybrid Time-Travel Retrieval in Evolving Corpora
Authors: Moritz Staudinger, Florina Piroi, Andreas Rauber,
Abstract summary: We present a hybrid retrieval system combining Lucene for fast retrieval with a column-store-based retrieval system maintaining a versioned and time-stamped index.
Score: 1.9202615342033464
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: There are settings in which reproducibility of ranked lists is desirable, such as when extracting a subset of an evolving document corpus for downstream research tasks or in domains such as patent retrieval or in medical systematic reviews, with high reproducibility expectations. However, as global term statistics change when documents change or are added to a corpus, queries using typical ranked retrieval models are not even reproducible for the parts of the document corpus that have not changed. Thus, Boolean retrieval frequently remains the mechanism of choice in such settings. We present a hybrid retrieval system combining Lucene for fast retrieval with a column-store-based retrieval system maintaining a versioned and time-stamped index. The latter component allows re-execution of previously posed queries resulting in the same ranked list and further allows for time-travel queries over evolving collection, as web archives, while maintaining the original ranking. Thus, retrieval results in evolving document collections are fully reproducible even when document collections and thus term statistics change.

Related papers

PRISM: Fine-Grained Paper-to-Paper Retrieval with Multi-Aspect-Aware Query Optimization [61.783280234747394]
PRISM is a document-to-document retrieval method that introduces multiple, fine-grained representations for both the query and candidate papers.<n>We present SciFullBench, a novel benchmark in which the complete and segmented context of full papers for both queries and candidates is available.<n>Experiments show that PRISM improves performance by an average of 4.3% over existing retrieval baselines.
arXiv Detail & Related papers (2025-07-14T08:41:53Z)
MURR: Model Updating with Regularized Replay for Searching a Document Stream [32.0637790321157]
Internet produces a continuous stream of new documents and user-generated queries. Neural retrieval models that were trained once on a fixed set of query-document pairs will quickly start misrepresenting newly-created content. We propose MURR, a model updating strategy with regularized replay, to ensure the model can still faithfully search existing documents without reprocessing.
arXiv Detail & Related papers (2025-04-14T14:13:03Z)
Cognitive-Aligned Document Selection for Retrieval-augmented Generation [2.9060210098040855]
We propose GGatrieval to dynamically update queries and filter high-quality, reliable retrieval documents. We parse the user query into its syntactic components and perform fine-grained grounded alignment with the retrieved documents. Our approach introduces a novel criterion for filtering retrieved documents, closely emulating human strategies for acquiring targeted information.
arXiv Detail & Related papers (2025-02-17T13:00:15Z)
Counterfactual Query Rewriting to Use Historical Relevance Feedback [25.893083499927776]
We propose approaches to rewrite user queries and compare them against a system that directly uses the previous qrels for the ranking. We expand queries with terms extracted from the previously relevant documents or derive so-called keyqueries that rank the previously relevant documents to the top of the current corpus. Our evaluation in the CLEF LongEval scenario shows that rewriting queries with historical relevance feedback improves the retrieval effectiveness and even outperforms computationally expensive transformer-based approaches.
arXiv Detail & Related papers (2025-02-06T09:05:41Z)
Ranking Narrative Query Graphs for Biomedical Document Retrieval (Technical Report) [7.527096697768715]
This paper extends our existing graph-based discovery system for the biomedical domain. It contributes effective graph-based unsupervised ranking methods, a new query relaxation paradigm, and ontological rewriting.
arXiv Detail & Related papers (2024-12-06T12:49:28Z)
Quam: Adaptive Retrieval through Query Affinity Modelling [15.3583908068962]
Building relevance models to rank documents based on user information needs is a central task in information retrieval and the NLP community. We propose a unifying view of the nascent area of adaptive retrieval by proposing, Quam. Our proposed approach, Quam improves the recall performance by up to 26% over the standard re-ranking baselines.
arXiv Detail & Related papers (2024-10-26T22:52:12Z)
Open-World Evaluation for Retrieving Diverse Perspectives [39.22331280176582]
We curate a Benchmark for Retrieval Diversity for Subjective questions (BERDS) Each example consists of a question and diverse perspectives associated with the question. We build a language model based automatic evaluator that decides whether each retrieved document contains a perspective.
arXiv Detail & Related papers (2024-09-26T17:52:57Z)
Database-Augmented Query Representation for Information Retrieval [59.57065228857247]
We present a novel retrieval framework called Database-Augmented Query representation (DAQu) DAQu augments the original query with various (query-related) metadata across multiple tables. We validate DAQu in diverse retrieval scenarios that can incorporate metadata from the relational database.
arXiv Detail & Related papers (2024-06-23T05:02:21Z)
ExcluIR: Exclusionary Neural Information Retrieval [74.08276741093317]
We present ExcluIR, a set of resources for exclusionary retrieval. evaluation benchmark includes 3,452 high-quality exclusionary queries. training set contains 70,293 exclusionary queries, each paired with a positive document and a negative document.
arXiv Detail & Related papers (2024-04-26T09:43:40Z)
Generative Retrieval as Multi-Vector Dense Retrieval [71.75503049199897]
Generative retrieval generates identifiers of relevant documents in an end-to-end manner. Prior work has demonstrated that generative retrieval with atomic identifiers is equivalent to single-vector dense retrieval. We show that generative retrieval and multi-vector dense retrieval share the same framework for measuring the relevance to a query of a document.
arXiv Detail & Related papers (2024-03-31T13:29:43Z)
Corrective Retrieval Augmented Generation [36.04062963574603]
Retrieval-augmented generation (RAG) relies heavily on relevance of retrieved documents, raising concerns about how the model behaves if retrieval goes wrong. We propose the Corrective Retrieval Augmented Generation (CRAG) to improve the robustness of generation. CRAG is plug-and-play and can be seamlessly coupled with various RAG-based approaches.
arXiv Detail & Related papers (2024-01-29T04:36:39Z)
Dense X Retrieval: What Retrieval Granularity Should We Use? [56.90827473115201]
Often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence. We introduce a novel retrieval unit, proposition, for dense retrieval. Experiments reveal that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks.
arXiv Detail & Related papers (2023-12-11T18:57:35Z)
Evaluating Generative Ad Hoc Information Retrieval [58.800799175084286]
generative retrieval systems often directly return a grounded generated text as a response to a query. Quantifying the utility of the textual responses is essential for appropriately evaluating such generative ad hoc retrieval.
arXiv Detail & Related papers (2023-11-08T14:05:00Z)
Archive TimeLine Summarization (ATLS): Conceptual Framework for Timeline Generation over Historical Document Collections [17.332692582748408]
We propose to extend TimeLine Summarization (TLS) methods on archive collections to assist in their studies. We describe a conceptual framework for an Archive TimeLine Summarization (ATLS) system, which aims to generate informative, readable and interpretable timelines.
arXiv Detail & Related papers (2023-01-31T08:58:47Z)
GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
We propose GERE, the first system that retrieves evidences in a generative fashion. The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines.
arXiv Detail & Related papers (2022-04-12T03:49:35Z)
CODER: An efficient framework for improving retrieval through COntextualized Document Embedding Reranking [11.635294568328625]
We present a framework for improving the performance of a wide class of retrieval models at minimal computational cost. It utilizes precomputed document representations extracted by a base dense retrieval method. It incurs a negligible computational overhead on top of any first-stage method at run time, allowing it to be easily combined with any state-of-the-art dense retrieval method.
arXiv Detail & Related papers (2021-12-16T10:25:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.