Related papers: A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look

A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look

URL: http://arxiv.org/abs/2411.08275v1
Date: Wed, 13 Nov 2024 01:12:35 GMT
Title: A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look
Authors: Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Daniel Campos, Nick Craswell, Ian Soboroff, Hoa Trang Dang, Jimmy Lin,
Abstract summary: This paper reports on the results of a large-scale evaluation (the TREC 2024 RAG Track) where four different relevance assessment approaches were deployed. We find that automatically generated UMBRELA judgments can replace fully manual judgments to accurately capture run-level effectiveness. Surprisingly, we find that LLM assistance does not appear to increase correlation with fully manual assessments, suggesting that costs associated with human-in-the-loop processes do not bring obvious tangible benefits.
Score: 52.114284476700874
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The application of large language models to provide relevance assessments presents exciting opportunities to advance information retrieval, natural language processing, and beyond, but to date many unknowns remain. This paper reports on the results of a large-scale evaluation (the TREC 2024 RAG Track) where four different relevance assessment approaches were deployed in situ: the "standard" fully manual process that NIST has implemented for decades and three different alternatives that take advantage of LLMs to different extents using the open-source UMBRELA tool. This setup allows us to correlate system rankings induced by the different approaches to characterize tradeoffs between cost and quality. We find that in terms of nDCG@20, nDCG@100, and Recall@100, system rankings induced by automatically generated relevance assessments from UMBRELA correlate highly with those induced by fully manual assessments across a diverse set of 77 runs from 19 teams. Our results suggest that automatically generated UMBRELA judgments can replace fully manual judgments to accurately capture run-level effectiveness. Surprisingly, we find that LLM assistance does not appear to increase correlation with fully manual assessments, suggesting that costs associated with human-in-the-loop processes do not bring obvious tangible benefits. Overall, human assessors appear to be stricter than UMBRELA in applying relevance criteria. Our work validates the use of LLMs in academic TREC-style evaluations and provides the foundation for future studies.

Related papers

Skewed Score: A statistical framework to assess autograders [2.9645858732618238]
"LLM-as-a-judge", or autograders, offer a scalable alternative to human evaluation.<n>They have shown mixed reliability and may exhibit systematic biases.<n>We propose a statistical framework that enables researchers to simultaneously assess their autograders.
arXiv Detail & Related papers (2025-07-04T18:45:10Z)
Benchmarking LLM-based Relevance Judgment Methods [15.255877686845773]
Large Language Models (LLMs) are increasingly deployed in both academic and industry settings.<n>We systematically compare multiple LLM-based relevance assessment methods, including binary relevance judgments, graded relevance assessments, pairwise preference-based methods, and two nugget-based evaluation methods.<n>As part of our data release, we include relevance judgments generated by both an open-source (Llama3.2b) and a commercial (gpt-4o) model.
arXiv Detail & Related papers (2025-04-17T01:13:21Z)
Judging the Judges: A Collection of LLM-Generated Relevance Judgements [37.103230004631996]
This paper benchmarks and reports on the results of a large-scale automatic relevance judgment evaluation, the LLMJudge challenge at SIGIR 2024. We release and benchmark 42 LLM-generated labels of the TREC 2023 Deep Learning track relevance judgments produced by eight international teams.
arXiv Detail & Related papers (2025-02-19T17:40:32Z)
Limitations of Automatic Relevance Assessments with Large Language Models for Fair and Reliable Retrieval Evaluation [2.9180406633632523]
Large language models (LLMs) are gaining much attention as tools for automatic relevance assessment. Recent research has shown that LLM-based assessments yield high systems ranking correlation with human-made judgements. We look at how LLM-generated judgements preserve ranking differences among top-performing systems and also how they preserve pairwise significance evaluation as human judgements.
arXiv Detail & Related papers (2024-11-20T11:19:35Z)
Self-Calibrated Listwise Reranking with Large Language Models [137.6557607279876]
Large language models (LLMs) have been employed in reranking tasks through a sequence-to-sequence approach. This reranking paradigm requires a sliding window strategy to iteratively handle larger candidate sets. We propose a novel self-calibrated listwise reranking method, which aims to leverage LLMs to produce global relevance scores for ranking.
arXiv Detail & Related papers (2024-11-07T10:31:31Z)
CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution [74.41064280094064]
textbfJudger-1 is the first open-source textbfall-in-one judge LLM. CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility. textbfJudgerBench is a new benchmark that encompasses various subjective evaluation tasks.
arXiv Detail & Related papers (2024-10-21T17:56:51Z)
Evaluation of Instruction-Following Ability for Large Language Models on Story-Ending Generation [2.4889060833127665]
In this paper, we focus on evaluating the instruction-following ability of Large Language Models (LLMs) in the context of story-ending generation. We propose an automatic evaluation pipeline that utilizes a machine reading comprehension (MRC) model to determine whether the generated story-ending reflects instruction.
arXiv Detail & Related papers (2024-06-24T06:53:36Z)
UMBRELA: UMbrela is the (Open-Source Reproduction of the) Bing RELevance Assessor [51.20527342770299]
UMBRELA is an open-source toolkit that reproduces the results of Thomas et al. using OpenAI's GPT-4o model. Our toolkit is designed to be easily studying and can be integrated into existing multi-stage retrieval and evaluation pipelines. UMBRELA will be used in the TREC 2024 RAG Track to aid in relevance assessments.
arXiv Detail & Related papers (2024-06-10T17:58:29Z)
Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators [48.54465599914978]
Large Language Models (LLMs) have demonstrated promising capabilities as automatic evaluators in assessing the quality of generated natural language. LLMs still exhibit biases in evaluation and often struggle to generate coherent evaluations that align with human assessments. We introduce Pairwise-preference Search (PAIRS), an uncertainty-guided search-based rank aggregation method that employs LLMs to conduct pairwise comparisons locally and efficiently ranks candidate texts globally.
arXiv Detail & Related papers (2024-03-25T17:11:28Z)
Benchmarking Cognitive Biases in Large Language Models as Evaluators [16.845939677403287]
Large Language Models (LLMs) have been shown to be effective as automatic evaluators with simple prompting and in-context learning. We evaluate the quality of ranking outputs introducing the Cognitive Bias Benchmark for LLMs as Evaluators. We find that LLMs are biased text quality evaluators, exhibiting strong indications on our bias benchmark.
arXiv Detail & Related papers (2023-09-29T06:53:10Z)
L-Eval: Instituting Standardized Evaluation for Long Context Language Models [91.05820785008527]
We propose L-Eval to institute a more standardized evaluation for long context language models (LCLMs) We build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs. Results show that popular n-gram matching metrics generally can not correlate well with human judgment.
arXiv Detail & Related papers (2023-07-20T17:59:41Z)
Style Over Substance: Evaluation Biases for Large Language Models [17.13064447978519]
This study investigates the behavior of crowd-sourced and expert annotators, as well as large language models (LLMs) Our findings reveal a concerning bias in the evaluation process, as answers with factual errors are rated more favorably than answers that are too short or contained grammatical errors. We propose independently evaluating machine-generated text across multiple dimensions, rather than merging all the evaluation aspects into a single score.
arXiv Detail & Related papers (2023-07-06T14:42:01Z)
Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization. We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.