Beyond Correlations: A Downstream Evaluation Framework for Query Performance Prediction
- URL: http://arxiv.org/abs/2601.17339v1
- Date: Sat, 24 Jan 2026 06:58:30 GMT
- Title: Beyond Correlations: A Downstream Evaluation Framework for Query Performance Prediction
- Authors: Payel Santra, Partha Basuchowdhuri, Debasis Ganguly,
- Abstract summary: The standard practice of query performance prediction (QPP) evaluation is to measure a set-level correlation between the estimated retrieval qualities and the true ones.<n>We propose a downstream-focussed evaluation framework where a distribution of QPP estimates across a list of top- Documents retrieved with several rankers is used as priors for IR fusion.<n>While on the one hand, a distribution of these estimates closely matching that of the true retrieval qualities indicates the quality of the predictor, their usage as priors on the other hand indicates a predictor's ability to make informed choices in an IR pipeline.
- Score: 10.378957672522157
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The standard practice of query performance prediction (QPP) evaluation is to measure a set-level correlation between the estimated retrieval qualities and the true ones. However, neither this correlation-based evaluation measure quantifies QPP effectiveness at the level of individual queries, nor does this connect to a downstream application, meaning that QPP methods yielding high correlation values may not find a practical application in query-specific decisions in an IR pipeline. In this paper, we propose a downstream-focussed evaluation framework where a distribution of QPP estimates across a list of top-documents retrieved with several rankers is used as priors for IR fusion. While on the one hand, a distribution of these estimates closely matching that of the true retrieval qualities indicates the quality of the predictor, their usage as priors on the other hand indicates a predictor's ability to make informed choices in an IR pipeline. Our experiments firstly establish the importance of QPP estimates in weighted IR fusion, yielding substantial improvements of over 4.5% over unweighted CombSUM and RRF fusion strategies, and secondly, reveal new insights that the downstream effectiveness of QPP does not correlate well with the standard correlation-based QPP evaluation.
Related papers
- Predicting Retrieval Utility and Answer Quality in Retrieval-Augmented Generation [24.439170886636788]
Key challenge for improving RAG is to predict both the utility of retrieved documents and the quality of the final answers in terms of correctness and relevance.<n>We define two prediction tasks within RAG: retrieval performance prediction and generation performance prediction.<n>We argue that reader-centric features, such as the LLM's perplexity of the retrieved context conditioned on the input query, can further enhance prediction accuracy.
arXiv Detail & Related papers (2026-01-20T23:59:54Z) - PERRY: Policy Evaluation with Confidence Intervals using Auxiliary Data [36.6443700664411]
Off-policy evaluation (OPE) methods aim to estimate the value of a new reinforcement learning (RL) policy prior to deployment.<n>We propose two approaches to construct valid confidence intervals for OPE when using data augmentation.
arXiv Detail & Related papers (2025-07-26T21:51:15Z) - Conformal Information Pursuit for Interactively Guiding Large Language Models [68.16703423481935]
This paper explores sequential querying strategies that aim to minimize the expected number of queries.<n>One such strategy is Information Pursuit (IP), a greedy algorithm that at each iteration selects the query that maximizes information gain or equivalently minimizes uncertainty.<n>We propose Conformal Information Pursuit (C-IP), an alternative approach to sequential information gain based on conformal prediction sets.
arXiv Detail & Related papers (2025-07-04T03:55:39Z) - Combining Query Performance Predictors: A Reproducibility Study [6.681467202699048]
As early as 2009, Hauff et al. [28] explored whether different QPP methods may be combined to improve prediction quality.<n>This study revisits Hauff et al.s work to assess the extent of their findings in the light of new prediction methods, evaluation metrics, and datasets.
arXiv Detail & Related papers (2025-03-31T16:01:58Z) - Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation [62.2436697657307]
Prediction-powered inference (PPI) is a method that improves statistical estimates based on limited human-labeled data.<n>We propose a method called Stratified Prediction-Powered Inference (StratPPI)<n>We show that the basic PPI estimates can be considerably improved by employing simple data stratification strategies.
arXiv Detail & Related papers (2024-06-06T17:37:39Z) - Query Performance Prediction using Relevance Judgments Generated by Large Language Models [53.97064615557883]
We propose a new Query performance prediction (QPP) framework using automatically generated relevance judgments (QPP-GenRE)<n>QPP-GenRE decomposes QPP into independent subtasks of predicting relevance of each item in a ranked list to a given query.<n>We predict an item's relevance by using open-source large language models (LLMs) to ensure scientific relevance.
arXiv Detail & Related papers (2024-04-01T09:33:05Z) - Query Performance Prediction: From Ad-hoc to Conversational Search [55.37199498369387]
Query performance prediction (QPP) is a core task in information retrieval.
Research has shown the effectiveness and usefulness of QPP for ad-hoc search.
Despite its potential, QPP for conversational search has been little studied.
arXiv Detail & Related papers (2023-05-18T12:37:01Z) - Towards Clear Expectations for Uncertainty Estimation [64.20262246029286]
Uncertainty Quantification (UQ) is crucial to achieve trustworthy Machine Learning (ML)
Most UQ methods suffer from disparate and inconsistent evaluation protocols.
This opinion paper offers a new perspective by specifying those requirements through five downstream tasks.
arXiv Detail & Related papers (2022-07-27T07:50:57Z) - QAFactEval: Improved QA-Based Factual Consistency Evaluation for
Summarization [116.56171113972944]
We show that carefully choosing the components of a QA-based metric is critical to performance.
Our solution improves upon the best-performing entailment-based metric and achieves state-of-the-art performance.
arXiv Detail & Related papers (2021-12-16T00:38:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.