Related papers: Redefining Retrieval Evaluation in the Era of LLMs

Redefining Retrieval Evaluation in the Era of LLMs

URL: http://arxiv.org/abs/2510.21440v1
Date: Fri, 24 Oct 2025 13:17:00 GMT
Title: Redefining Retrieval Evaluation in the Era of LLMs
Authors: Giovanni Trappolini, Florin Cuconasu, Simone Filice, Yoelle Maarek, Fabrizio Silvestri,
Abstract summary: Traditional Information Retrieval (IR) metrics assume that human users sequentially examine documents with diminishing attention to lower ranks.<n>This assumption breaks down in Retrieval Augmented Generation (RAG) systems, where search results are consumed by Large Language Models (LLMs)<n>We introduce a utility-based annotation schema that quantifies both the positive contribution of relevant passages and the negative impact of distracting ones.
Score: 20.75884808285362
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Traditional Information Retrieval (IR) metrics, such as nDCG, MAP, and MRR, assume that human users sequentially examine documents with diminishing attention to lower ranks. This assumption breaks down in Retrieval Augmented Generation (RAG) systems, where search results are consumed by Large Language Models (LLMs), which, unlike humans, process all retrieved documents as a whole rather than sequentially. Additionally, traditional IR metrics do not account for related but irrelevant documents that actively degrade generation quality, rather than merely being ignored. Due to these two major misalignments, namely human vs. machine position discount and human relevance vs. machine utility, classical IR metrics do not accurately predict RAG performance. We introduce a utility-based annotation schema that quantifies both the positive contribution of relevant passages and the negative impact of distracting ones. Building on this foundation, we propose UDCG (Utility and Distraction-aware Cumulative Gain), a metric using an LLM-oriented positional discount to directly optimize the correlation with the end-to-end answer accuracy. Experiments on five datasets and six LLMs demonstrate that UDCG improves correlation by up to 36% compared to traditional metrics. Our work provides a critical step toward aligning IR evaluation with LLM consumers and enables more reliable assessment of RAG components

Related papers

Completing Missing Annotation: Multi-Agent Debate for Accurate and Scalable Relevant Assessment for IR Benchmarks [31.017987800426894]
DREAM is a multi-round debate-based relevance assessment framework with LLM agents.<n>It achieves 95.2% labeling accuracy with only 3.5% human involvement.<n>BRIDGE is a refined benchmark that mitigates evaluation bias and enables fairer retriever comparison.
arXiv Detail & Related papers (2026-02-06T09:27:03Z)
LLM4SCREENLIT: Recommendations on Assessing the Performance of Large Language Models for Screening Literature in Systematic Reviews [2.2175470459999636]
We identify problems with the use of traditional metrics for assessing the performance of Gen-AI tools for identifying relevant literature in systematic reviews.<n>Major weaknesses included a failure to use metrics that are robust to imbalanced data and do not directly indicate whether results are better than chance.<n>On the positive side, we extract good (evaluation) practices on which our recommendations for researchers and practitioners, as well as policymakers, are built.
arXiv Detail & Related papers (2025-11-16T15:04:50Z)
LLM Unlearning using Gradient Ratio-Based Influence Estimation and Noise Injection [0.0]
Existing empirical methods often yield incomplete forgetting or unintended degradation of unrelated knowledge due to poor localization.<n>GRIN introduces a novel gradient-ratio-based metric to identify parameters most responsible for memorizing forget data.<n>We then perform selective noise injection into these parameters prior to fine-tuning, which improves unlearning performance while maintaining model utility.
arXiv Detail & Related papers (2025-08-08T17:15:32Z)
Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers [59.168391398830515]
We evaluate 12 pre-trained LLMs and one specialized fact-verifier, using a collection of examples from 14 fact-checking benchmarks.<n>We highlight the importance of addressing annotation errors and ambiguity in datasets.<n> frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance.
arXiv Detail & Related papers (2025-06-16T10:32:10Z)
Utility-Focused LLM Annotation for Retrieval and Retrieval-Augmented Generation [96.18720164390699]
This paper explores the use of large language models (LLMs) for annotating document utility in training retrieval and retrieval-augmented generation (RAG) systems.<n>Our results show that LLM-generated annotations enhance out-of-domain retrieval performance and improve RAG outcomes compared to models trained solely on human annotations or downstream QA metrics.
arXiv Detail & Related papers (2025-04-07T16:05:52Z)
Towards Understanding the Robustness of LLM-based Evaluations under Perturbations [9.944512689015998]
Large Language Models (LLMs) can serve as automatic evaluators for non-standardized metrics in summarization and dialog-based tasks.<n>We conduct experiments across multiple prompting strategies to examine how LLMs fare as quality evaluators when compared with human judgments.
arXiv Detail & Related papers (2024-12-12T13:31:58Z)
A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look [52.114284476700874]
This paper reports on the results of a large-scale evaluation (the TREC 2024 RAG Track) where four different relevance assessment approaches were deployed. We find that automatically generated UMBRELA judgments can replace fully manual judgments to accurately capture run-level effectiveness. Surprisingly, we find that LLM assistance does not appear to increase correlation with fully manual assessments, suggesting that costs associated with human-in-the-loop processes do not bring obvious tangible benefits.
arXiv Detail & Related papers (2024-11-13T01:12:35Z)
Beyond the Numbers: Transparency in Relation Extraction Benchmark Creation and Leaderboards [5.632231145349045]
This paper investigates the transparency in the creation of benchmarks and the use of leaderboards for measuring progress in NLP. Existing relation extraction benchmarks often suffer from insufficient documentation and lack crucial details. While our discussion centers on the transparency of RE benchmarks and leaderboards, the observations we discuss are broadly applicable to other NLP tasks as well.
arXiv Detail & Related papers (2024-11-07T22:36:19Z)
LLMs are Biased Evaluators But Not Biased for Retrieval Augmented Generation [28.61326111959728]
Large language models (LLMs) exhibit significant biases in evaluation tasks, particularly in preferentially rating and favoring self-generated content. Our study addresses this knowledge gap by simulating two critical phases of the retrieval-augmented generation (RAG) framework. Contrary to previous findings, our results reveal no significant self-preference effect in RAG frameworks.
arXiv Detail & Related papers (2024-10-28T08:32:09Z)
Reliable Confidence Intervals for Information Retrieval Evaluation Using Generative A.I [39.92942310783174]
Large language models (LLMs) can generate relevance annotations at an enormous scale with relatively small computational costs. We propose two methods based on prediction-powered inference and conformal risk control. Our experimental results show that our CIs accurately capture both the variance and bias in evaluation.
arXiv Detail & Related papers (2024-07-02T17:44:00Z)
FIRST: Faster Improved Listwise Reranking with Single Token Decoding [56.727761901751194]
First, we introduce FIRST, a novel listwise LLM reranking approach leveraging the output logits of the first generated identifier to directly obtain a ranked ordering of the candidates. Empirical results demonstrate that FIRST accelerates inference by 50% while maintaining a robust ranking performance with gains across the BEIR benchmark. Our results show that LLM rerankers can provide a stronger distillation signal compared to cross-encoders, yielding substantial improvements in retriever recall after relevance feedback.
arXiv Detail & Related papers (2024-06-21T21:27:50Z)
A Theory for Token-Level Harmonization in Retrieval-Augmented Generation [76.75124161306795]
Retrieval-augmented generation (RAG) utilizes retrieved texts to enhance large language models (LLMs)<n>This paper provides a theory to explain and trade off the benefit and detriment in RAG.<n>Based on our theory, we propose a practical novel method, Tok-RAG, which achieves collaborative generation between the pure LLM and RAG.
arXiv Detail & Related papers (2024-06-03T02:56:14Z)
Query Performance Prediction using Relevance Judgments Generated by Large Language Models [53.97064615557883]
We propose a new Query performance prediction (QPP) framework using automatically generated relevance judgments (QPP-GenRE)<n>QPP-GenRE decomposes QPP into independent subtasks of predicting relevance of each item in a ranked list to a given query.<n>We predict an item's relevance by using open-source large language models (LLMs) to ensure scientific relevance.
arXiv Detail & Related papers (2024-04-01T09:33:05Z)
Towards Multiple References Era -- Addressing Data Leakage and Limited Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks. Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations. We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.