Combining Query Performance Predictors: A Reproducibility Study
- URL: http://arxiv.org/abs/2503.24251v1
- Date: Mon, 31 Mar 2025 16:01:58 GMT
- Title: Combining Query Performance Predictors: A Reproducibility Study
- Authors: Sourav Saha, Suchana Datta, Dwaipayan Roy, Mandar Mitra, Derek Greene,
- Abstract summary: As early as 2009, Hauff et al. [28] explored whether different QPP methods may be combined to improve prediction quality.<n>This study revisits Hauff et al.s work to assess the extent of their findings in the light of new prediction methods, evaluation metrics, and datasets.
- Score: 6.681467202699048
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A large number of approaches to Query Performance Prediction (QPP) have been proposed over the last two decades. As early as 2009, Hauff et al. [28] explored whether different QPP methods may be combined to improve prediction quality. Since then, significant research has been done both on QPP approaches, as well as their evaluation. This study revisits Hauff et al.s work to assess the reproducibility of their findings in the light of new prediction methods, evaluation metrics, and datasets. We expand the scope of the earlier investigation by: (i) considering post-retrieval methods, including supervised neural techniques (only pre-retrieval techniques were studied in [28]); (ii) using sMARE for evaluation, in addition to the traditional correlation coefficients and RMSE; and (iii) experimenting with additional datasets (Clueweb09B and TREC DL). Our results largely support previous claims, but we also present several interesting findings. We interpret these findings by taking a more nuanced look at the correlation between QPP methods, examining whether they capture diverse information or rely on overlapping factors.
Related papers
- Beyond Correlations: A Downstream Evaluation Framework for Query Performance Prediction [10.378957672522157]
The standard practice of query performance prediction (QPP) evaluation is to measure a set-level correlation between the estimated retrieval qualities and the true ones.<n>We propose a downstream-focussed evaluation framework where a distribution of QPP estimates across a list of top- Documents retrieved with several rankers is used as priors for IR fusion.<n>While on the one hand, a distribution of these estimates closely matching that of the true retrieval qualities indicates the quality of the predictor, their usage as priors on the other hand indicates a predictor's ability to make informed choices in an IR pipeline.
arXiv Detail & Related papers (2026-01-24T06:58:30Z) - Constructing Confidence Intervals for Average Treatment Effects from Multiple Datasets [51.2467404472005]
We propose a new method that estimates the ATE from multiple observational datasets and provides valid CIs.<n>Our method makes little assumptions about the observational datasets and is thus widely applicable in medical practice.
arXiv Detail & Related papers (2024-12-16T07:39:46Z) - Query Performance Prediction using Relevance Judgments Generated by Large Language Models [53.97064615557883]
We propose a QPP framework using automatically generated relevance judgments (QPP-GenRE)
QPP-GenRE decomposes QPP into independent subtasks of predicting relevance of each item in a ranked list to a given query.
This allows us to predict any IR evaluation measure using the generated relevance judgments as pseudo-labels.
arXiv Detail & Related papers (2024-04-01T09:33:05Z) - Query Performance Prediction: From Ad-hoc to Conversational Search [55.37199498369387]
Query performance prediction (QPP) is a core task in information retrieval.
Research has shown the effectiveness and usefulness of QPP for ad-hoc search.
Despite its potential, QPP for conversational search has been little studied.
arXiv Detail & Related papers (2023-05-18T12:37:01Z) - Forecast reconciliation for vaccine supply chain optimization [61.13962963550403]
Vaccine supply chain optimization can benefit from hierarchical time series forecasting.
Forecasts of different hierarchy levels become incoherent when higher levels do not match the sum of the lower levels forecasts.
We tackle the vaccine sale forecasting problem by modeling sales data from GSK between 2010 and 2021 as a hierarchical time series.
arXiv Detail & Related papers (2023-05-02T14:34:34Z) - Development and Evaluation of Conformal Prediction Methods for QSAR [0.5161531917413706]
The quantitative structure-activity relationship (QSAR) regression model is a commonly used technique for predicting biological activities of compounds.
Most machine learning (ML) algorithms that achieve superior predictive performance require some add-on methods for estimating uncertainty of their prediction.
Conformal prediction (CP) is a promising approach. It is agnostic to the prediction algorithm and can produce valid prediction intervals under some weak assumptions on the data distribution.
arXiv Detail & Related papers (2023-04-03T13:41:09Z) - Generalization bounds and algorithms for estimating conditional average
treatment effect of dosage [13.867315751451494]
We investigate the task of estimating the conditional average causal effect of treatment-dosage pairs from a combination of observational data and assumptions on the causal relationships in the underlying system.
This has been a longstanding challenge for fields of study such as epidemiology or economics that require a treatment-dosage pair to make decisions.
We show empirically new state-of-the-art performance results across several benchmark datasets for this problem.
arXiv Detail & Related papers (2022-05-29T15:26:59Z) - Insights into performance evaluation of com-pound-protein interaction
prediction methods [0.0]
Machine learning based prediction of compound-protein interactions (CPIs) is important for drug design, screening and repurposing studies.
We have observed a number of fundamental issues in experiment design that lead to over optimistic estimates of model performance.
arXiv Detail & Related papers (2022-01-28T20:07:19Z) - Towards a Rigorous Evaluation of Time-series Anomaly Detection [15.577148857778484]
In recent years, proposed studies on time-series anomaly detection (TAD) report high F1 scores on benchmark TAD datasets.
Most studies apply a peculiar evaluation protocol called point adjustment (PA) before scoring.
In this paper, we reveal that the PA protocol has a great possibility of overestimating the detection performance.
arXiv Detail & Related papers (2021-09-11T11:14:04Z) - Reenvisioning Collaborative Filtering vs Matrix Factorization [65.74881520196762]
Collaborative filtering models based on matrix factorization and learned similarities using Artificial Neural Networks (ANNs) have gained significant attention in recent years.
Announcement of ANNs within the recommendation ecosystem has been recently questioned, raising several comparisons in terms of efficiency and effectiveness.
We show the potential these techniques may have on beyond-accuracy evaluation while analyzing effect on complementary evaluation dimensions.
arXiv Detail & Related papers (2021-07-28T16:29:38Z) - Double Robust Representation Learning for Counterfactual Prediction [68.78210173955001]
We propose a novel scalable method to learn double-robust representations for counterfactual predictions.
We make robust and efficient counterfactual predictions for both individual and average treatment effects.
The algorithm shows competitive performance with the state-of-the-art on real world and synthetic data.
arXiv Detail & Related papers (2020-10-15T16:39:26Z) - A Survey on Causal Inference [64.45536158710014]
Causal inference is a critical research topic across many domains, such as statistics, computer science, education, public policy and economics.
Various causal effect estimation methods for observational data have sprung up.
arXiv Detail & Related papers (2020-02-05T21:35:29Z) - Overly Optimistic Prediction Results on Imbalanced Data: a Case Study of
Flaws and Benefits when Applying Over-sampling [13.463035357173045]
We focus on one specific type of methodological flaw: applying over-sampling before partitioning the data into mutually exclusive training and testing sets.
We show how this causes the results to be biased using two artificial datasets and reproduce results of studies in which this flaw was identified.
arXiv Detail & Related papers (2020-01-15T12:53:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.