Related papers: Don't Use LLMs to Make Relevance Judgments

Don't Use LLMs to Make Relevance Judgments

URL: http://arxiv.org/abs/2409.15133v1
Date: Mon, 23 Sep 2024 15:38:12 GMT
Title: Don't Use LLMs to Make Relevance Judgments
Authors: Ian Soboroff,
Abstract summary: The recent advent of large language models that produce astoundingly human-like flowing text output in response to a natural language prompt has inspired IR researchers to wonder how those models might be used in the relevance judgment collection process. At the ACM SIGIR 2024 conference, a workshop LLM4Eval'' provided a venue for this work, and featured a data challenge activity where participants reproduced TREC deep learning track judgments. The bottom-line-up-front message is, don't use LLMs to create relevance judgments for TREC-style evaluations.
Score: 5.678164657239931
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Making the relevance judgments for a TREC-style test collection can be complex and expensive. A typical TREC track usually involves a team of six contractors working for 2-4 weeks. Those contractors need to be trained and monitored. Software has to be written to support recording relevance judgments correctly and efficiently. The recent advent of large language models that produce astoundingly human-like flowing text output in response to a natural language prompt has inspired IR researchers to wonder how those models might be used in the relevance judgment collection process. At the ACM SIGIR 2024 conference, a workshop ``LLM4Eval'' provided a venue for this work, and featured a data challenge activity where participants reproduced TREC deep learning track judgments, as was done by Thomas et al (arXiv:2408.08896, arXiv:2309.10621). I was asked to give a keynote at the workshop, and this paper presents that keynote in article form. The bottom-line-up-front message is, don't use LLMs to create relevance judgments for TREC-style evaluations.

Related papers

On the Effectiveness of LLM-as-a-judge for Code Generation and Summarization [54.965787768076254]
Large Language Models have been recently exploited as judges for complex natural language processing tasks, such as Q&A.<n>We study the effectiveness of LLMs-as-a-judge for two code-related tasks, namely code generation and code summarization.
arXiv Detail & Related papers (2025-07-22T13:40:26Z)
Can Large Language Models Be Trusted Paper Reviewers? A Feasibility Study [24.387202495452886]
This work explores the feasibility of using Large Language Models (LLMs) for academic paper review.<n>The system integrates Retrieval Augmented Generation (RAG), the AutoGen multi-agent system, and Chain-of-Thought prompting to support tasks such as format checking, standardized evaluation, comment generation, and scoring.<n>Experiments conducted on 290 submissions from the WASA 2024 conference using GPT-4o show that LLM-based review significantly reduces review time (average 2.48 hours) and cost (average $104.28 USD)
arXiv Detail & Related papers (2025-06-18T10:19:18Z)
An Empirical Study of Many-to-Many Summarization with Large Language Models [82.10000188179168]
Large language models (LLMs) have shown strong multi-lingual abilities, giving them the potential to perform Many-to-many summarization (M2MS) in real applications.<n>This work presents a systematic empirical study on LLMs' M2MS ability.
arXiv Detail & Related papers (2025-05-19T11:18:54Z)
Information-Guided Identification of Training Data Imprint in (Proprietary) Large Language Models [52.439289085318634]
We show how to identify training data known to proprietary large language models (LLMs) by using information-guided probes. Our work builds on a key observation: text passages with high surprisal are good search material for memorization probes.
arXiv Detail & Related papers (2025-03-15T10:19:15Z)
Judging the Judges: A Collection of LLM-Generated Relevance Judgements [37.103230004631996]
This paper benchmarks and reports on the results of a large-scale automatic relevance judgment evaluation, the LLMJudge challenge at SIGIR 2024. We release and benchmark 42 LLM-generated labels of the TREC 2023 Deep Learning track relevance judgments produced by eight international teams.
arXiv Detail & Related papers (2025-02-19T17:40:32Z)
LLM-Assisted Relevance Assessments: When Should We Ask LLMs for Help? [18.663118865354427]
Test collections are information retrieval tools that allow researchers to quickly and easily evaluate ranking algorithms. We propose textbfLLM-textbfAssisted textbfRelevance textbfAssessments (textbfLARA) to balance manual annotations with LLM annotations.
arXiv Detail & Related papers (2024-11-11T11:17:35Z)
Rewriting Conversational Utterances with Instructed Large Language Models [9.38751103209178]
Large language models (LLMs) can achieve state-of-the-art performance on many NLP tasks. We study which prompts provide the most informative utterances that lead to the best retrieval performance. The results show that rewriting conversational utterances with instructed LLMs achieves significant improvements of up to 25.2% in MRR, 31.7% in Precision@1, 27% in NDCG@3, and 11.5% in Recall@500 over state-of-the-art techniques.
arXiv Detail & Related papers (2024-10-10T10:30:28Z)
Are LLMs Good Annotators for Discourse-level Event Relation Extraction? [15.365993658296016]
We assess the effectiveness of Large Language Models (LLMs) in addressing discourse-level event relation extraction tasks. Evaluation is conducted using an commercial model, GPT-3.5, and an open-source model, LLaMA-2.
arXiv Detail & Related papers (2024-07-28T19:27:06Z)
Re-Ranking Step by Step: Investigating Pre-Filtering for Re-Ranking with Large Language Models [5.0490573482829335]
Large Language Models (LLMs) have been revolutionizing a myriad of natural language processing tasks with their diverse zero-shot capabilities. This paper investigates the use of a pre-filtering step before passage re-ranking in information retrieval (IR) Our experiments show that this pre-filtering then allows the LLM to perform significantly better at the re-ranking task.
arXiv Detail & Related papers (2024-06-26T20:12:24Z)
LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing [106.45895712717612]
Large language models (LLMs) have shown remarkable versatility in various generative tasks. This study focuses on the topic of LLMs assist NLP Researchers. To our knowledge, this is the first work to provide such a comprehensive analysis.
arXiv Detail & Related papers (2024-06-24T01:30:22Z)
UMBRELA: UMbrela is the (Open-Source Reproduction of the) Bing RELevance Assessor [51.20527342770299]
UMBRELA is an open-source toolkit that reproduces the results of Thomas et al. using OpenAI's GPT-4o model. Our toolkit is designed to be easily studying and can be integrated into existing multi-stage retrieval and evaluation pipelines. UMBRELA will be used in the TREC 2024 RAG Track to aid in relevance assessments.
arXiv Detail & Related papers (2024-06-10T17:58:29Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
L-Eval: Instituting Standardized Evaluation for Long Context Language Models [91.05820785008527]
We propose L-Eval to institute a more standardized evaluation for long context language models (LCLMs) We build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs. Results show that popular n-gram matching metrics generally can not correlate well with human judgment.
arXiv Detail & Related papers (2023-07-20T17:59:41Z)
Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks [12.723777984461693]
Large language models (LLMs) are remarkable data annotators. Crowdsourcing, an important, inexpensive way to obtain human annotations, may itself be impacted by LLMs. We estimate that 33-46% of crowd workers used LLMs when completing a task.
arXiv Detail & Related papers (2023-06-13T16:46:24Z)
Document-Level Machine Translation with Large Language Models [91.03359121149595]
Large language models (LLMs) can produce coherent, cohesive, relevant, and fluent answers for various natural language processing (NLP) tasks. This paper provides an in-depth evaluation of LLMs' ability on discourse modeling.
arXiv Detail & Related papers (2023-04-05T03:49:06Z)
Self-Refine: Iterative Refinement with Self-Feedback [62.78755306241981]
Self-Refine is an approach for improving initial outputs from large language models (LLMs) through iterative feedback and refinement. We evaluate Self-Refine across 7 diverse tasks, ranging from dialog response generation to mathematical reasoning, using state-of-the-art (GPT-3.5, ChatGPT, and GPT-4) LLMs. Our work demonstrates that even state-of-the-art LLMs like GPT-4 can be further improved at test time using our simple, standalone approach.
arXiv Detail & Related papers (2023-03-30T18:30:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.