Related papers: Benchmarking LLM-based Relevance Judgment Methods

Benchmarking LLM-based Relevance Judgment Methods

URL: http://arxiv.org/abs/2504.12558v1
Date: Thu, 17 Apr 2025 01:13:21 GMT
Title: Benchmarking LLM-based Relevance Judgment Methods
Authors: Negar Arabzadeh, Charles L. A. Clarke,
Abstract summary: Large Language Models (LLMs) are increasingly deployed in both academic and industry settings.<n>We systematically compare multiple LLM-based relevance assessment methods, including binary relevance judgments, graded relevance assessments, pairwise preference-based methods, and two nugget-based evaluation methods.<n>As part of our data release, we include relevance judgments generated by both an open-source (Llama3.2b) and a commercial (gpt-4o) model.
Score: 15.255877686845773
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) are increasingly deployed in both academic and industry settings to automate the evaluation of information seeking systems, particularly by generating graded relevance judgments. Previous work on LLM-based relevance assessment has primarily focused on replicating graded human relevance judgments through various prompting strategies. However, there has been limited exploration of alternative assessment methods or comprehensive comparative studies. In this paper, we systematically compare multiple LLM-based relevance assessment methods, including binary relevance judgments, graded relevance assessments, pairwise preference-based methods, and two nugget-based evaluation methods~--~document-agnostic and document-dependent. In addition to a traditional comparison based on system rankings using Kendall correlations, we also examine how well LLM judgments align with human preferences, as inferred from relevance grades. We conduct extensive experiments on datasets from three TREC Deep Learning tracks 2019, 2020 and 2021 as well as the ANTIQUE dataset, which focuses on non-factoid open-domain question answering. As part of our data release, we include relevance judgments generated by both an open-source (Llama3.2b) and a commercial (gpt-4o) model. Our goal is to \textit{reproduce} various LLM-based relevance judgment methods to provide a comprehensive comparison. All code, data, and resources are publicly available in our GitHub Repository at https://github.com/Narabzad/llm-relevance-judgement-comparison.

Related papers

Nonparametric LLM Evaluation from Preference Data [86.96268870461472]
We propose a nonparametric statistical framework, DMLEval, for comparing and ranking large language models (LLMs) from preference data.<n>Our framework provides practitioners with powerful, state-of-the-art methods for comparing or ranking LLMs.
arXiv Detail & Related papers (2026-01-29T15:00:07Z)
Query-Document Dense Vectors for LLM Relevance Judgment Bias Analysis [4.719505127252616]
Large Language Models (LLMs) have been used as relevance assessors for Information Retrieval (IR) evaluation collection creation.<n>We aim to understand if LLMs make systematic mistakes when judging relevance, rather than just understanding how good they are on average.<n>We introduce a clustering-based framework that embeds query-document (Q-D) pairs into a joint semantic space.
arXiv Detail & Related papers (2026-01-05T03:02:33Z)
Criteria-Based LLM Relevance Judgments [5.478764356647438]
Large Language Models (LLMs) provide a scalable solution by generating relevance labels directly through prompting.<n>We propose the Multi-Criteria framework for LLM-based relevance judgments, decomposing the notion of relevance into multiple criteria.<n>Our results demonstrate that Multi-Criteria judgments enhance the system ranking/leaderboard performance.
arXiv Detail & Related papers (2025-07-13T04:21:21Z)
Validating LLM-Generated Relevance Labels for Educational Resource Search [2.2175950967382487]
We release a dataset of 401 human relevance judgements from a user study involving teaching professionals performing search tasks related to lesson planning. Using domain-specific frameworks, LLMs achieved strong agreement with human judgements. System-level evaluation showed that LLM judgements reliably identified top-performing retrieval approaches.
arXiv Detail & Related papers (2025-04-17T08:14:45Z)
Preference Leakage: A Contamination Problem in LLM-as-a-judge [69.96778498636071]
Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods.<n>In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators.
arXiv Detail & Related papers (2025-02-03T17:13:03Z)
Limitations of Automatic Relevance Assessments with Large Language Models for Fair and Reliable Retrieval Evaluation [2.9180406633632523]
Large language models (LLMs) are gaining much attention as tools for automatic relevance assessment.<n>Recent research has shown that LLM-based assessments yield high systems ranking correlation with human-made judgements.<n>We look at how LLM-generated judgements preserve ranking differences among top-performing systems and also how they preserve pairwise significance evaluation as human judgements.
arXiv Detail & Related papers (2024-11-20T11:19:35Z)
Ranking Unraveled: Recipes for LLM Rankings in Head-to-Head AI Combat [7.8905223445925055]
Pairwise ranking has emerged as a new method for evaluating human preferences for large language models (LLM)<n>We explore the effectiveness of ranking systems for head-to-head comparisons of LLMs.<n>Our analysis uncovers key insights into the factors that affect ranking accuracy and efficiency.
arXiv Detail & Related papers (2024-11-19T20:16:26Z)
A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look [52.114284476700874]
This paper reports on the results of a large-scale evaluation (the TREC 2024 RAG Track) where four different relevance assessment approaches were deployed. We find that automatically generated UMBRELA judgments can replace fully manual judgments to accurately capture run-level effectiveness. Surprisingly, we find that LLM assistance does not appear to increase correlation with fully manual assessments, suggesting that costs associated with human-in-the-loop processes do not bring obvious tangible benefits.
arXiv Detail & Related papers (2024-11-13T01:12:35Z)
Best in Tau@LLMJudge: Criteria-Based Relevance Evaluation with Llama3 [5.478764356647438]
We explore alternative methods to prompt large language models (LLMs) for assigned relevance labels. We consider various ways to aggregate criteria-level grades into a relevance label. We include an empirical evaluation of our approaches based on data from the LLMJudge challenge run in Summer 2024.
arXiv Detail & Related papers (2024-10-17T21:37:08Z)
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [66.93260816493553]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.<n>With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance.<n> Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators [48.54465599914978]
Large Language Models (LLMs) have demonstrated promising capabilities as automatic evaluators in assessing the quality of generated natural language. LLMs still exhibit biases in evaluation and often struggle to generate coherent evaluations that align with human assessments. We introduce Pairwise-preference Search (PAIRS), an uncertainty-guided search-based rank aggregation method that employs LLMs to conduct pairwise comparisons locally and efficiently ranks candidate texts globally.
arXiv Detail & Related papers (2024-03-25T17:11:28Z)
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation. We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z)
PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations [10.709365940160685]
Modern large language models (LLMs) are hard to evaluate and compare automatically. We propose a peer rank (PR) algorithm that takes into account each peer LLM's pairwise preferences of all answer pairs. We find that our approaches achieve higher accuracy and align better with human judgments.
arXiv Detail & Related papers (2023-07-06T04:05:44Z)
KoLA: Carefully Benchmarking World Knowledge of Large Language Models [87.96683299084788]
We construct a Knowledge-oriented LLM Assessment benchmark (KoLA) We mimic human cognition to form a four-level taxonomy of knowledge-related abilities, covering $19$ tasks. We use both Wikipedia, a corpus prevalently pre-trained by LLMs, along with continuously collected emerging corpora, to evaluate the capacity to handle unseen data and evolving knowledge.
arXiv Detail & Related papers (2023-06-15T17:20:46Z)
Perspectives on Large Language Models for Relevance Judgment [56.935731584323996]
Large language models (LLMs) claim that they can assist with relevance judgments. It is not clear whether automated judgments can reliably be used in evaluations of retrieval systems.
arXiv Detail & Related papers (2023-04-13T13:08:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.