Don't Use LLMs to Make Relevance Judgments
- URL: http://arxiv.org/abs/2409.15133v1
- Date: Mon, 23 Sep 2024 15:38:12 GMT
- Title: Don't Use LLMs to Make Relevance Judgments
- Authors: Ian Soboroff,
- Abstract summary: The recent advent of large language models that produce astoundingly human-like flowing text output in response to a natural language prompt has inspired IR researchers to wonder how those models might be used in the relevance judgment collection process.
At the ACM SIGIR 2024 conference, a workshop LLM4Eval'' provided a venue for this work, and featured a data challenge activity where participants reproduced TREC deep learning track judgments.
The bottom-line-up-front message is, don't use LLMs to create relevance judgments for TREC-style evaluations.
- Score: 5.678164657239931
- License:
- Abstract: Making the relevance judgments for a TREC-style test collection can be complex and expensive. A typical TREC track usually involves a team of six contractors working for 2-4 weeks. Those contractors need to be trained and monitored. Software has to be written to support recording relevance judgments correctly and efficiently. The recent advent of large language models that produce astoundingly human-like flowing text output in response to a natural language prompt has inspired IR researchers to wonder how those models might be used in the relevance judgment collection process. At the ACM SIGIR 2024 conference, a workshop ``LLM4Eval'' provided a venue for this work, and featured a data challenge activity where participants reproduced TREC deep learning track judgments, as was done by Thomas et al (arXiv:2408.08896, arXiv:2309.10621). I was asked to give a keynote at the workshop, and this paper presents that keynote in article form. The bottom-line-up-front message is, don't use LLMs to create relevance judgments for TREC-style evaluations.
Related papers
- LLM-Assisted Relevance Assessments: When Should We Ask LLMs for Help? [18.663118865354427]
Test collections are information retrieval tools that allow researchers to quickly and easily evaluate ranking algorithms.
We propose textbfLLM-textbfAssisted textbfRelevance textbfAssessments (textbfLARA) to balance manual annotations with LLM annotations.
arXiv Detail & Related papers (2024-11-11T11:17:35Z) - Rewriting Conversational Utterances with Instructed Large Language Models [9.38751103209178]
Large language models (LLMs) can achieve state-of-the-art performance on many NLP tasks.
We study which prompts provide the most informative utterances that lead to the best retrieval performance.
The results show that rewriting conversational utterances with instructed LLMs achieves significant improvements of up to 25.2% in MRR, 31.7% in Precision@1, 27% in NDCG@3, and 11.5% in Recall@500 over state-of-the-art techniques.
arXiv Detail & Related papers (2024-10-10T10:30:28Z) - Re-Ranking Step by Step: Investigating Pre-Filtering for Re-Ranking with Large Language Models [5.0490573482829335]
Large Language Models (LLMs) have been revolutionizing a myriad of natural language processing tasks with their diverse zero-shot capabilities.
This paper investigates the use of a pre-filtering step before passage re-ranking in information retrieval (IR)
Our experiments show that this pre-filtering then allows the LLM to perform significantly better at the re-ranking task.
arXiv Detail & Related papers (2024-06-26T20:12:24Z) - LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing [106.45895712717612]
Large language models (LLMs) have shown remarkable versatility in various generative tasks.
This study focuses on the topic of LLMs assist NLP Researchers.
To our knowledge, this is the first work to provide such a comprehensive analysis.
arXiv Detail & Related papers (2024-06-24T01:30:22Z) - UMBRELA: UMbrela is the (Open-Source Reproduction of the) Bing RELevance Assessor [51.20527342770299]
UMBRELA is an open-source toolkit that reproduces the results of Thomas et al. using OpenAI's GPT-4o model.
Our toolkit is designed to be easily studying and can be integrated into existing multi-stage retrieval and evaluation pipelines.
UMBRELA will be used in the TREC 2024 RAG Track to aid in relevance assessments.
arXiv Detail & Related papers (2024-06-10T17:58:29Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - L-Eval: Instituting Standardized Evaluation for Long Context Language
Models [91.05820785008527]
We propose L-Eval to institute a more standardized evaluation for long context language models (LCLMs)
We build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs.
Results show that popular n-gram matching metrics generally can not correlate well with human judgment.
arXiv Detail & Related papers (2023-07-20T17:59:41Z) - Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use
Large Language Models for Text Production Tasks [12.723777984461693]
Large language models (LLMs) are remarkable data annotators.
Crowdsourcing, an important, inexpensive way to obtain human annotations, may itself be impacted by LLMs.
We estimate that 33-46% of crowd workers used LLMs when completing a task.
arXiv Detail & Related papers (2023-06-13T16:46:24Z) - Document-Level Machine Translation with Large Language Models [91.03359121149595]
Large language models (LLMs) can produce coherent, cohesive, relevant, and fluent answers for various natural language processing (NLP) tasks.
This paper provides an in-depth evaluation of LLMs' ability on discourse modeling.
arXiv Detail & Related papers (2023-04-05T03:49:06Z) - Self-Refine: Iterative Refinement with Self-Feedback [62.78755306241981]
Self-Refine is an approach for improving initial outputs from large language models (LLMs) through iterative feedback and refinement.
We evaluate Self-Refine across 7 diverse tasks, ranging from dialog response generation to mathematical reasoning, using state-of-the-art (GPT-3.5, ChatGPT, and GPT-4) LLMs.
Our work demonstrates that even state-of-the-art LLMs like GPT-4 can be further improved at test time using our simple, standalone approach.
arXiv Detail & Related papers (2023-03-30T18:30:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.