Related papers: PRECISE: Reducing the Bias of LLM Evaluations Using Prediction-Powered Ranking Estimation

PRECISE: Reducing the Bias of LLM Evaluations Using Prediction-Powered Ranking Estimation

URL: http://arxiv.org/abs/2601.18777v1
Date: Mon, 26 Jan 2026 18:46:49 GMT
Title: PRECISE: Reducing the Bias of LLM Evaluations Using Prediction-Powered Ranking Estimation
Authors: Abhishek Divekar, Anirban Majumder,
Abstract summary: Prediction-Powered Inference (PPI) combines minimal human annotations with Large Language Models (LLMs) to produce reliable estimates of metrics.<n>Our method requires as few as 100 human-annotated queries and 10,000 unlabeled examples, reducing annotation requirements significantly.
Score: 3.867363075280545
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Evaluating the quality of search, ranking and RAG systems traditionally requires a significant number of human relevance annotations. In recent times, several deployed systems have explored the usage of Large Language Models (LLMs) as automated judges for this task while their inherent biases prevent direct use for metric estimation. We present a statistical framework extending Prediction-Powered Inference (PPI) that combines minimal human annotations with LLM judgments to produce reliable estimates of metrics which require sub-instance annotations. Our method requires as few as 100 human-annotated queries and 10,000 unlabeled examples, reducing annotation requirements significantly compared to traditional approaches. We formulate our proposed framework (PRECISE) for inference of relevance uplift for an LLM-based query reformulation application, extending PPI to sub-instance annotations at the query-document level. By reformulating the metric-integration space, we reduced the computational complexity from O(2^|C|) to O(2^K), where |C| represents corpus size (in order of millions). Detailed experiments across prominent retrieval datasets demonstrate that our method reduces the variance of estimates for the business-critical Precision@K metric, while effectively correcting for LLM bias in low-resource settings.

Related papers

Context-Adaptive Requirements Defect Prediction through Human-LLM Collaboration [1.4499356176178066]
We propose a Human-LLM Collaboration (HLC) approach that treats defect prediction as an adaptive process rather than a static classification task.<n>We evaluate this approach using the weak word smell on the QuRE benchmark of 1,266 annotated Mercedes-Benz requirements.
arXiv Detail & Related papers (2026-01-05T10:00:14Z)
Benchmarking Debiasing Methods for LLM-based Parameter Estimates [7.935918489638212]
Large language models (LLMs) offer an inexpensive yet powerful way to annotate text, but are often inconsistent when compared with experts.<n>To mitigate this bias, researchers have developed debiasing methods such as Design-based Supervised Learning (Supervised Learning) and Prediction-Powered Inference (PPI)<n>We compare DSL and PPI across a range of tasks, finding that although both achieve low bias with large datasets, DSL often outperforms PPI on bias reduction and empirical efficiency, but its performance is less consistent across datasets.
arXiv Detail & Related papers (2025-06-11T11:37:02Z)
AcuRank: Uncertainty-Aware Adaptive Computation for Listwise Reranking [35.685682379377134]
Listwise reranking with large language models (LLMs) enhances top-ranked results in retrieval-based applications.<n>We propose AcuRank, an adaptive reranking framework that dynamically adjusts both the amount and target of computation based on uncertainty estimates over document relevance.<n>Results on the TREC-DL and BEIR benchmarks show that our method consistently achieves a superior accuracy-efficiency trade-off and scales better with compute than fixed-computation baselines.
arXiv Detail & Related papers (2025-05-24T05:15:49Z)
Utility-Focused LLM Annotation for Retrieval and Retrieval-Augmented Generation [96.18720164390699]
This paper explores the use of large language models (LLMs) for annotating document utility in training retrieval and retrieval-augmented generation (RAG) systems.<n>Our results show that LLM-generated annotations enhance out-of-domain retrieval performance and improve RAG outcomes compared to models trained solely on human annotations or downstream QA metrics.
arXiv Detail & Related papers (2025-04-07T16:05:52Z)
Self-Calibrated Listwise Reranking with Large Language Models [137.6557607279876]
Large language models (LLMs) have been employed in reranking tasks through a sequence-to-sequence approach. This reranking paradigm requires a sliding window strategy to iteratively handle larger candidate sets. We propose a novel self-calibrated listwise reranking method, which aims to leverage LLMs to produce global relevance scores for ranking.
arXiv Detail & Related papers (2024-11-07T10:31:31Z)
Improved Diversity-Promoting Collaborative Metric Learning for Recommendation [127.08043409083687]
Collaborative Metric Learning (CML) has recently emerged as a popular method in recommendation systems. This paper focuses on a challenging scenario where a user has multiple categories of interests. We propose a novel method called textitDiversity-Promoting Collaborative Metric Learning (DPCML)
arXiv Detail & Related papers (2024-09-02T07:44:48Z)
FineSurE: Fine-grained Summarization Evaluation using LLMs [22.62504593575933]
FineSurE is a fine-grained evaluator specifically tailored for the summarization task using large language models (LLMs) It also employs completeness and conciseness criteria, in addition to faithfulness, enabling multi-dimensional assessment.
arXiv Detail & Related papers (2024-07-01T02:20:28Z)
RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation. Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
Query Performance Prediction using Relevance Judgments Generated by Large Language Models [53.97064615557883]
We propose a new Query performance prediction (QPP) framework using automatically generated relevance judgments (QPP-GenRE)<n>QPP-GenRE decomposes QPP into independent subtasks of predicting relevance of each item in a ranked list to a given query.<n>We predict an item's relevance by using open-source large language models (LLMs) to ensure scientific relevance.
arXiv Detail & Related papers (2024-04-01T09:33:05Z)
Regression-aware Inference with LLMs [52.764328080398805]
We show that an inference strategy can be sub-optimal for common regression and scoring evaluation metrics. We propose alternate inference strategies that estimate the Bayes-optimal solution for regression and scoring metrics in closed-form from sampled responses.
arXiv Detail & Related papers (2024-03-07T03:24:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.