Related papers: Evaluation of Temporal Change in IR Test Collections

Evaluation of Temporal Change in IR Test Collections

URL: http://arxiv.org/abs/2407.01373v1
Date: Mon, 01 Jul 2024 15:25:31 GMT
Title: Evaluation of Temporal Change in IR Test Collections
Authors: Jüri Keller, Timo Breuer, Philipp Schaer,
Abstract summary: This work investigates how the temporal generalizability of effectiveness evaluations can be assessed. We show that the proposed measures can be well adapted to describe the changes in the retrieval results.
Score: 3.4917392789760147
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Information retrieval systems have been evaluated using the Cranfield paradigm for many years. This paradigm allows a systematic, fair, and reproducible evaluation of different retrieval methods in fixed experimental environments. However, real-world retrieval systems must cope with dynamic environments and temporal changes that affect the document collection, topical trends, and the individual user's perception of what is considered relevant. Yet, the temporal dimension in IR evaluations is still understudied. To this end, this work investigates how the temporal generalizability of effectiveness evaluations can be assessed. As a conceptual model, we generalize Cranfield-type experiments to the temporal context by classifying the change in the essential components according to the create, update, and delete operations of persistent storage known from CRUD. From the different types of change different evaluation scenarios are derived and it is outlined what they imply. Based on these scenarios, renowned state-of-the-art retrieval systems are tested and it is investigated how the retrieval effectiveness changes on different levels of granularity. We show that the proposed measures can be well adapted to describe the changes in the retrieval results. The experiments conducted confirm that the retrieval effectiveness strongly depends on the evaluation scenario investigated. We find that not only the average retrieval performance of single systems but also the relative system performance are strongly affected by the components that change and to what extent these components changed.

Related papers

Investigating the Robustness of Retrieval-Augmented Generation at the Query Level [4.3028340012580975]
Retrieval-augmented generation (RAG) has been proposed as a solution that dynamically incorporates external knowledge during inference.<n>Despite its promise, RAG systems face practical challenges-most notably, a strong dependence on the quality of the input query for accurate retrieval.
arXiv Detail & Related papers (2025-07-09T15:39:17Z)
Why Do Class-Dependent Evaluation Effects Occur with Time Series Feature Attributions? A Synthetic Data Investigation [5.136283512042341]
"Class-dependent evaluation effects" raise questions about whether perturbation analysis reliably measures attribution quality.<n>We compare perturbation-based degradation scores with ground truth-based precision-recall metrics using multiple attribution methods.<n>Most critically, we find that perturbation-based and ground truth metrics frequently yield contradictory assessments of attribution quality across classes.
arXiv Detail & Related papers (2025-06-13T13:52:32Z)
Collaborative Value Function Estimation Under Model Mismatch: A Federated Temporal Difference Analysis [55.13545823385091]
Federated reinforcement learning (FedRL) enables collaborative learning while preserving data privacy by preventing direct data exchange between agents. In real-world applications, each agent may experience slightly different transition dynamics, leading to inherent model mismatches. We show that even moderate levels of information sharing can significantly mitigate environment-specific errors.
arXiv Detail & Related papers (2025-03-21T18:06:28Z)
Variations in Relevance Judgments and the Shelf Life of Test Collections [50.060833338921945]
paradigm shift towards neural retrieval models affected the characteristics of modern test collections. We reproduce prior work in the neural retrieval setting, showing that assessor disagreement does not affect system rankings. We observe that some models substantially degrade with our new relevance judgments, and some have already reached the effectiveness of humans as rankers.
arXiv Detail & Related papers (2025-02-28T10:46:56Z)
CoFE-RAG: A Comprehensive Full-chain Evaluation Framework for Retrieval-Augmented Generation with Enhanced Data Diversity [23.48167670445722]
Retrieval-Augmented Generation (RAG) aims to generate more accurate and reliable answers with the help of the retrieved context from external knowledge sources. evaluating these systems remains a crucial research area due to the following issues. We propose a Comprehensive Full-chain Evaluation (CoFE-RAG) framework to facilitate thorough evaluation across the entire RAG pipeline.
arXiv Detail & Related papers (2024-10-16T05:20:32Z)
Replicability Measures for Longitudinal Information Retrieval Evaluation [3.4917392789760147]
This work explores how the effectiveness measured in evolving experiments can be assessed. The persistency of effectiveness is investigated as a replicability task. It was found that the most effective systems are not necessarily the ones with the most persistent performance.
arXiv Detail & Related papers (2024-09-09T08:19:43Z)
Meta-Learners for Partially-Identified Treatment Effects Across Multiple Environments [67.80453452949303]
Estimating the conditional average treatment effect (CATE) from observational data is relevant for many applications such as personalized medicine. Here, we focus on the widespread setting where the observational data come from multiple environments. We propose different model-agnostic learners (so-called meta-learners) to estimate the bounds that can be used in combination with arbitrary machine learning models.
arXiv Detail & Related papers (2024-06-04T16:31:43Z)
Process Variant Analysis Across Continuous Features: A Novel Framework [0.0]
This research addresses the challenge of effectively segmenting cases within operational processes. We present a novel approach employing a sliding window technique combined with the earth mover's distance to detect changes in control flow behavior. We validate our methodology through a real-life case study in collaboration with UWV, the Dutch employee insurance agency.
arXiv Detail & Related papers (2024-05-06T16:10:13Z)
Distilled Datamodel with Reverse Gradient Matching [74.75248610868685]
We introduce an efficient framework for assessing data impact, comprising offline training and online evaluation stages. Our proposed method achieves comparable model behavior evaluation while significantly speeding up the process compared to the direct retraining method.
arXiv Detail & Related papers (2024-04-22T09:16:14Z)
Decoy Effect In Search Interaction: Understanding User Behavior and Measuring System Vulnerability [33.78769577114657]
The study explores how decoy results alter users' interactions on search engine result pages. It introduces the DEJA-VU metric to assess systems' susceptibility to the decoy effect. The results show differences in systems' effectiveness and vulnerability.
arXiv Detail & Related papers (2024-03-27T11:20:48Z)
Systematic Evaluation of Predictive Fairness [60.0947291284978]
Mitigating bias in training on biased datasets is an important open problem. We examine the performance of various debiasing methods across multiple tasks. We find that data conditions have a strong influence on relative model performance.
arXiv Detail & Related papers (2022-10-17T05:40:13Z)
Evaluating generative audio systems and their metrics [80.97828572629093]
This paper investigates state-of-the-art approaches side-by-side with (i) a set of previously proposed objective metrics for audio reconstruction, and (ii) a listening study. Results indicate that currently used objective metrics are insufficient to describe the perceptual quality of current systems.
arXiv Detail & Related papers (2022-08-31T21:48:34Z)
Stateful Offline Contextual Policy Evaluation and Learning [88.9134799076718]
We study off-policy evaluation and learning from sequential data. We formalize the relevant causal structure of problems such as dynamic personalized pricing. We show improved out-of-sample policy performance in this class of relevant problems.
arXiv Detail & Related papers (2021-10-19T16:15:56Z)
Fairness and underspecification in acoustic scene classification: The case for disaggregated evaluations [6.186191586944725]
Underspecification and fairness in machine learning (ML) applications have recently become two prominent issues in the ML community. We argue for the need of a more holistic evaluation process for Acoustic scene classification (ASC) models through disaggregated evaluations. We demonstrate the effectiveness of the proposed evaluation process in uncovering underspecification and fairness problems when trained on two widely-used ASC datasets.
arXiv Detail & Related papers (2021-10-04T15:23:01Z)
Generalization Bounds and Representation Learning for Estimation of Potential Outcomes and Causal Effects [61.03579766573421]
We study estimation of individual-level causal effects, such as a single patient's response to alternative medication. We devise representation learning algorithms that minimize our bound, by regularizing the representation's induced treatment group distance. We extend these algorithms to simultaneously learn a weighted representation to further reduce treatment group distances.
arXiv Detail & Related papers (2020-01-21T10:16:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.