Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges
- URL: http://arxiv.org/abs/2504.15205v1
- Date: Mon, 21 Apr 2025 16:20:43 GMT
- Title: Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges
- Authors: Nandan Thakur, Ronak Pradeep, Shivani Upadhyay, Daniel Campos, Nick Craswell, Jimmy Lin,
- Abstract summary: A crucial factor in RAG evaluation is "support", whether the information in the cited documents supports the answer.<n>We conducted a large-scale comparative study of 45 participant submissions on 36 topics to the TREC 2024 RAG Track.<n>Our results indicate that for 56% of the manual from-scratch assessments, human and GPT-4o predictions match perfectly.
- Score: 53.12387628636912
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Retrieval-augmented generation (RAG) enables large language models (LLMs) to generate answers with citations from source documents containing "ground truth", thereby reducing system hallucinations. A crucial factor in RAG evaluation is "support", whether the information in the cited documents supports the answer. To this end, we conducted a large-scale comparative study of 45 participant submissions on 36 topics to the TREC 2024 RAG Track, comparing an automatic LLM judge (GPT-4o) against human judges for support assessment. We considered two conditions: (1) fully manual assessments from scratch and (2) manual assessments with post-editing of LLM predictions. Our results indicate that for 56% of the manual from-scratch assessments, human and GPT-4o predictions match perfectly (on a three-level scale), increasing to 72% in the manual with post-editing condition. Furthermore, by carefully analyzing the disagreements in an unbiased study, we found that an independent human judge correlates better with GPT-4o than a human judge, suggesting that LLM judges can be a reliable alternative for support assessment. To conclude, we provide a qualitative analysis of human and GPT-4o errors to help guide future iterations of support assessment.
Related papers
- The Viability of Crowdsourcing for RAG Evaluation [39.275627272019925]
We present the Crowd RAG Corpus 2025 (CrowdRAG-25), which consists of 903 human-written and 903 LLM-generated responses for the 301 topics of the TREC RAG'24 track.
Our analyses give insights into human writing behavior for RAG and the viability of crowdsourcing for RAG evaluation.
arXiv Detail & Related papers (2025-04-22T08:13:34Z) - JudgeLRM: Large Reasoning Models as a Judge [65.14085339820795]
We investigate whether Large Language Models (LLMs) judges truly benefit from enhanced reasoning capabilities.<n>We introduce JudgeLRM, a family of judgment-oriented LLMs trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards.
arXiv Detail & Related papers (2025-03-31T02:18:51Z) - HREF: Human Response-Guided Evaluation of Instruction Following in Language Models [61.273153125847166]
We develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF)<n>In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination.<n>We study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template.
arXiv Detail & Related papers (2024-12-20T03:26:47Z) - A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look [52.114284476700874]
This paper reports on the results of a large-scale evaluation (the TREC 2024 RAG Track) where four different relevance assessment approaches were deployed.
We find that automatically generated UMBRELA judgments can replace fully manual judgments to accurately capture run-level effectiveness.
Surprisingly, we find that LLM assistance does not appear to increase correlation with fully manual assessments, suggesting that costs associated with human-in-the-loop processes do not bring obvious tangible benefits.
arXiv Detail & Related papers (2024-11-13T01:12:35Z) - Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions [18.93335792080899]
We investigate how much influence prompting the LLMs-as-a-judge has on the alignment of AI judgements to human judgements.
We aggregate a taxonomy of quality criteria commonly used across state-of-the-art evaluations with LLMs and provide this as a rigorous benchmark of models as judges.
arXiv Detail & Related papers (2024-08-16T14:49:35Z) - Aligning Model Evaluations with Human Preferences: Mitigating Token Count Bias in Language Model Assessments [2.1370543868467275]
This follow-up paper explores methods to align Large Language Models evaluator preferences with human evaluations.
We employed Bayesian statistics and a t-test to quantify this bias and developed a recalibration procedure to adjust the GPTScorer.
Our findings significantly improve aligning the recalibrated LLM evaluator with human evaluations across multiple use cases.
arXiv Detail & Related papers (2024-07-05T09:26:40Z) - Split and Merge: Aligning Position Biases in LLM-based Evaluators [22.265542509143756]
PORTIA is an alignment-based system designed to mimic human comparison strategies to calibrate position bias.<n>Our results show that PORTIA markedly enhances the consistency rates for all the models and comparison forms tested.<n>It rectifies around 80% of the position bias instances within the GPT-4 model, elevating its consistency rate up to 98%.
arXiv Detail & Related papers (2023-09-29T14:38:58Z) - Perspectives on Large Language Models for Relevance Judgment [56.935731584323996]
Large language models (LLMs) claim that they can assist with relevance judgments.
It is not clear whether automated judgments can reliably be used in evaluations of retrieval systems.
arXiv Detail & Related papers (2023-04-13T13:08:38Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.