Related papers: Query Performance Prediction using Relevance Judgments Generated by Large Language Models

Query Performance Prediction using Relevance Judgments Generated by Large Language Models

URL: http://arxiv.org/abs/2404.01012v2
Date: Mon, 17 Jun 2024 11:23:20 GMT
Title: Query Performance Prediction using Relevance Judgments Generated by Large Language Models
Authors: Chuan Meng, Negar Arabzadeh, Arian Askari, Mohammad Aliannejadi, Maarten de Rijke,
Abstract summary: We propose a QPP framework using automatically generated relevance judgments (QPP-GenRE) QPP-GenRE decomposes QPP into independent subtasks of predicting relevance of each item in a ranked list to a given query. This allows us to predict any IR evaluation measure using the generated relevance judgments as pseudo-labels.
Score: 53.97064615557883
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Query performance prediction (QPP) aims to estimate the retrieval quality of a search system for a query without human relevance judgments. Previous QPP methods typically return a single scalar value and do not require the predicted values to approximate a specific information retrieval (IR) evaluation measure, leading to certain drawbacks: (i) a single scalar is insufficient to accurately represent different IR evaluation measures, especially when metrics do not highly correlate, and (ii) a single scalar limits the interpretability of QPP methods because solely using a scalar is insufficient to explain QPP results. To address these issues, we propose a QPP framework using automatically generated relevance judgments (QPP-GenRE), which decomposes QPP into independent subtasks of predicting the relevance of each item in a ranked list to a given query. This allows us to predict any IR evaluation measure using the generated relevance judgments as pseudo-labels. This also allows us to interpret predicted IR evaluation measures, and identify, track and rectify errors in generated relevance judgments to improve QPP quality. We predict an item's relevance by using open-source large language models (LLMs) to ensure scientific reproducibility. We face two main challenges: (i) excessive computational costs of judging an entire corpus for predicting a metric considering recall, and (ii) limited performance in prompting open-source LLMs in a zero-/few-shot manner. To solve the challenges, we devise an approximation strategy to predict an IR measure considering recall and propose to fine-tune open-source LLMs using human-labeled relevance judgments. Experiments on the TREC 2019-2022 deep learning tracks show that QPP-GenRE achieves state-of-the-art QPP quality for both lexical and neural rankers.

Related papers

Conformal Information Pursuit for Interactively Guiding Large Language Models [64.39770942422288]
This paper explores sequential querying strategies that aim to minimize the expected number of queries.<n>One such strategy is Information Pursuit (IP), a greedy algorithm that at each iteration selects the query that maximizes information gain or equivalently minimizes uncertainty.<n>We propose Conformal Information Pursuit (C-IP), an alternative approach to sequential information gain based on conformal prediction sets.
arXiv Detail & Related papers (2025-07-04T03:55:39Z)
Conformal Prediction Beyond the Seen: A Missing Mass Perspective for Uncertainty Quantification in Generative Models [20.810300785340072]
Conformal Prediction with Query Oracle (CPQ) is a framework characterizing the optimal interplay between these objectives.<n>Our algorithm is built on two core principles: one governs the optimal query policy, and the other defines the optimal mapping from queried samples to prediction sets.
arXiv Detail & Related papers (2025-06-05T18:26:14Z)
Uncovering the Limitations of Query Performance Prediction: Failures, Insights, and Implications for Selective Query Processing [3.463527836552468]
This paper provides a comprehensive evaluation of state-of-the-art QPPs (e.g. NQC, UQC) We use diverse sparse rankers (BM25, DFree without and with query expansion) and hybrid or dense (SPLADE and ColBert) rankers and diverse test collections ROBUST, GOV2, WT10G, and MS MARCO. Results show significant variability in predictors accuracy, with collections as the main factor and rankers next.
arXiv Detail & Related papers (2025-04-01T18:18:21Z)
PredictaBoard: Benchmarking LLM Score Predictability [50.47497036981544]
Large Language Models (LLMs) often fail unpredictably. This poses a significant challenge to ensuring their safe deployment. We present PredictaBoard, a novel collaborative benchmarking framework.
arXiv Detail & Related papers (2025-02-20T10:52:38Z)
Addressing Uncertainty in LLMs to Enhance Reliability in Generative AI [47.64301863399763]
We present a dynamic semantic clustering approach inspired by the Chinese Restaurant Process. We quantify uncertainty of Large Language Models (LLMs) on a given query by calculating entropy of the generated semantic clusters. We propose leveraging the (negative) likelihood of these clusters as the (non)conformity score within Conformal Prediction framework.
arXiv Detail & Related papers (2024-11-04T18:49:46Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
Mitigating LLM Hallucinations via Conformal Abstention [70.83870602967625]
We develop a principled procedure for determining when a large language model should abstain from responding in a general domain. We leverage conformal prediction techniques to develop an abstention procedure that benefits from rigorous theoretical guarantees on the hallucination rate (error rate) Experimentally, our resulting conformal abstention method reliably bounds the hallucination rate on various closed-book, open-domain generative question answering datasets.
arXiv Detail & Related papers (2024-04-04T11:32:03Z)
Query Performance Prediction: From Ad-hoc to Conversational Search [55.37199498369387]
Query performance prediction (QPP) is a core task in information retrieval. Research has shown the effectiveness and usefulness of QPP for ad-hoc search. Despite its potential, QPP for conversational search has been little studied.
arXiv Detail & Related papers (2023-05-18T12:37:01Z)
iQPP: A Benchmark for Image Query Performance Prediction [24.573869540845124]
We propose the first benchmark for image query performance prediction (iQPP) We estimate the ground-truth difficulty of each query as the average precision or the precision@k, using two state-of-the-art image retrieval models. Next, we propose and evaluate novel pre-retrieval and post-retrieval query performance predictors, comparing them with existing or adapted (from text to image) predictors. Our comprehensive experiments indicate that iQPP is a challenging benchmark, revealing an important research gap that needs to be addressed in future work.
arXiv Detail & Related papers (2023-02-20T17:56:57Z)
Uncertainty Quantification with Pre-trained Language Models: A Large-Scale Empirical Analysis [120.9545643534454]
It is crucial for the pipeline to minimize the calibration error, especially in safety-critical applications. There are various considerations behind the pipeline: (1) the choice and (2) the size of PLM, (3) the choice of uncertainty quantifier, (4) the choice of fine-tuning loss, and many more. In response, we recommend the following: (1) use ELECTRA for PLM encoding, (2) use larger PLMs if possible, (3) use Temp Scaling as the uncertainty quantifier, and (4) use Focal Loss for fine-tuning.
arXiv Detail & Related papers (2022-10-10T14:16:01Z)
Towards Clear Expectations for Uncertainty Estimation [64.20262246029286]
Uncertainty Quantification (UQ) is crucial to achieve trustworthy Machine Learning (ML) Most UQ methods suffer from disparate and inconsistent evaluation protocols. This opinion paper offers a new perspective by specifying those requirements through five downstream tasks.
arXiv Detail & Related papers (2022-07-27T07:50:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.