Perspectives on Large Language Models for Relevance Judgment
- URL: http://arxiv.org/abs/2304.09161v2
- Date: Sat, 18 Nov 2023 18:16:41 GMT
- Title: Perspectives on Large Language Models for Relevance Judgment
- Authors: Guglielmo Faggioli, Laura Dietz, Charles Clarke, Gianluca Demartini,
Matthias Hagen, Claudia Hauff, Noriko Kando, Evangelos Kanoulas, Martin
Potthast, Benno Stein, Henning Wachsmuth
- Abstract summary: Large language models (LLMs) claim that they can assist with relevance judgments.
It is not clear whether automated judgments can reliably be used in evaluations of retrieval systems.
- Score: 56.935731584323996
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: When asked, large language models (LLMs) like ChatGPT claim that they can
assist with relevance judgments but it is not clear whether automated judgments
can reliably be used in evaluations of retrieval systems. In this perspectives
paper, we discuss possible ways for LLMs to support relevance judgments along
with concerns and issues that arise. We devise a human--machine collaboration
spectrum that allows to categorize different relevance judgment strategies,
based on how much humans rely on machines. For the extreme point of "fully
automated judgments", we further include a pilot experiment on whether
LLM-based relevance judgments correlate with judgments from trained human
assessors. We conclude the paper by providing opposing perspectives for and
against the use of~LLMs for automatic relevance judgments, and a compromise
perspective, informed by our analyses of the literature, our preliminary
experimental evidence, and our experience as IR researchers.
Related papers
- On the Statistical Significance with Relevance Assessments of Large Language Models [2.9180406633632523]
We use Large Language Models for labelling relevance of documents for building new retrieval test collections.
Our results show that LLM judgements detect most of the significant differences while maintaining acceptable numbers of false positives.
Our work represents a step forward in the evaluation of statistical testing results provided by LLM judgements.
arXiv Detail & Related papers (2024-11-20T11:19:35Z) - A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look [52.114284476700874]
This paper reports on the results of a large-scale evaluation (the TREC 2024 RAG Track) where four different relevance assessment approaches were deployed.
We find that automatically generated UMBRELA judgments can replace fully manual judgments to accurately capture run-level effectiveness.
Surprisingly, we find that LLM assistance does not appear to increase correlation with fully manual assessments, suggesting that costs associated with human-in-the-loop processes do not bring obvious tangible benefits.
arXiv Detail & Related papers (2024-11-13T01:12:35Z) - JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking [81.88787401178378]
We introduce JudgeRank, a novel agentic reranker that emulates human cognitive processes when assessing document relevance.
We evaluate JudgeRank on the reasoning-intensive BRIGHT benchmark, demonstrating substantial performance improvements over first-stage retrieval methods.
In addition, JudgeRank performs on par with fine-tuned state-of-the-art rerankers on the popular BEIR benchmark, validating its zero-shot generalization capability.
arXiv Detail & Related papers (2024-10-31T18:43:12Z) - Adversarial Multi-Agent Evaluation of Large Language Models through Iterative Debates [0.0]
We propose a framework that interprets large language models (LLMs) as advocates within an ensemble of interacting agents.
This approach offers a more dynamic and comprehensive evaluation process compared to traditional human-based assessments or automated metrics.
arXiv Detail & Related papers (2024-10-07T00:22:07Z) - From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks [11.01213914485374]
We study large language models (LLMs) on mathematical reasoning tasks.
Our analysis uncovers a strong correlation between judgment performance and the candidate model task performance.
We show that it is possible to use statistics, such as the task performances of the individual models, to predict judgment performance.
arXiv Detail & Related papers (2024-09-06T10:09:41Z) - Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text [12.879551933541345]
Large Language Models (LLMs) are capable of generating human-like conversations.
Conventional metrics like BLEU and ROUGE are inadequate for capturing the subtle semantics and contextual richness of such generative outputs.
We propose a reference-guided verdict method that automates the evaluation process by leveraging multiple LLMs-as-judges.
arXiv Detail & Related papers (2024-08-17T16:01:45Z) - Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators [48.54465599914978]
Large Language Models (LLMs) have demonstrated promising capabilities in assessing the quality of generated natural language.
LLMs still exhibit biases in evaluation and often struggle to generate coherent evaluations that align with human assessments.
We introduce Pairwise-preference Search (PairS), an uncertainty-guided search method that employs LLMs to conduct pairwise comparisons and efficiently ranks candidate texts.
arXiv Detail & Related papers (2024-03-25T17:11:28Z) - Using Natural Language Explanations to Rescale Human Judgments [81.66697572357477]
We propose a method to rescale ordinal annotations and explanations using large language models (LLMs)
We feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric.
Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.
arXiv Detail & Related papers (2023-05-24T06:19:14Z) - Investigating Fairness Disparities in Peer Review: A Language Model
Enhanced Approach [77.61131357420201]
We conduct a thorough and rigorous study on fairness disparities in peer review with the help of large language models (LMs)
We collect, assemble, and maintain a comprehensive relational database for the International Conference on Learning Representations (ICLR) conference from 2017 to date.
We postulate and study fairness disparities on multiple protective attributes of interest, including author gender, geography, author, and institutional prestige.
arXiv Detail & Related papers (2022-11-07T16:19:42Z) - AI for human assessment: What do professional assessors need? [33.88509725285237]
This case study aims to help professional assessors make decisions in human assessment, in which they conduct interviews with assessees and evaluate their suitability for certain job roles.
A computational system that can extract nonverbal cues of assesses would be beneficial to assessors in terms of supporting their decision making.
We developed such a system based on an unsupervised anomaly detection algorithm using multimodal behavioral features such as facial keypoints, pose, head pose, and gaze.
arXiv Detail & Related papers (2022-04-18T03:35:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.