Related papers: UMBRELA: UMbrela is the (Open-Source Reproduction of the) Bing RELevance Assessor

UMBRELA: UMbrela is the (Open-Source Reproduction of the) Bing RELevance Assessor

URL: http://arxiv.org/abs/2406.06519v1
Date: Mon, 10 Jun 2024 17:58:29 GMT
Title: UMBRELA: UMbrela is the (Open-Source Reproduction of the) Bing RELevance Assessor
Authors: Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Nick Craswell, Jimmy Lin,
Abstract summary: UMBRELA is an open-source toolkit that reproduces the results of Thomas et al. using OpenAI's GPT-4o model. Our toolkit is designed to be easily studying and can be integrated into existing multi-stage retrieval and evaluation pipelines. UMBRELA will be used in the TREC 2024 RAG Track to aid in relevance assessments.
Score: 51.20527342770299
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Copious amounts of relevance judgments are necessary for the effective training and accurate evaluation of retrieval systems. Conventionally, these judgments are made by human assessors, rendering this process expensive and laborious. A recent study by Thomas et al. from Microsoft Bing suggested that large language models (LLMs) can accurately perform the relevance assessment task and provide human-quality judgments, but unfortunately their study did not yield any reusable software artifacts. Our work presents UMBRELA (a recursive acronym that stands for UMbrela is the Bing RELevance Assessor), an open-source toolkit that reproduces the results of Thomas et al. using OpenAI's GPT-4o model and adds more nuance to the original paper. Across Deep Learning Tracks from TREC 2019 to 2023, we find that LLM-derived relevance judgments correlate highly with rankings generated by effective multi-stage retrieval systems. Our toolkit is designed to be easily extensible and can be integrated into existing multi-stage retrieval and evaluation pipelines, offering researchers a valuable resource for studying retrieval evaluation methodologies. UMBRELA will be used in the TREC 2024 RAG Track to aid in relevance assessments, and we envision our toolkit becoming a foundation for further innovation in the field. UMBRELA is available at https://github.com/castorini/umbrela.

Related papers

Iterative Self-Incentivization Empowers Large Language Models as Agentic Searchers [74.17516978246152]
Large language models (LLMs) have been widely integrated into information retrieval to advance traditional techniques.<n>We propose EXSEARCH, an agentic search framework, where the LLM learns to retrieve useful information as the reasoning unfolds.<n>Experiments on four knowledge-intensive benchmarks show that EXSEARCH substantially outperforms baselines.
arXiv Detail & Related papers (2025-05-26T15:27:55Z)
Judging the Judges: A Collection of LLM-Generated Relevance Judgements [37.103230004631996]
This paper benchmarks and reports on the results of a large-scale automatic relevance judgment evaluation, the LLMJudge challenge at SIGIR 2024. We release and benchmark 42 LLM-generated labels of the TREC 2023 Deep Learning track relevance judgments produced by eight international teams.
arXiv Detail & Related papers (2025-02-19T17:40:32Z)
A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look [52.114284476700874]
This paper reports on the results of a large-scale evaluation (the TREC 2024 RAG Track) where four different relevance assessment approaches were deployed. We find that automatically generated UMBRELA judgments can replace fully manual judgments to accurately capture run-level effectiveness. Surprisingly, we find that LLM assistance does not appear to increase correlation with fully manual assessments, suggesting that costs associated with human-in-the-loop processes do not bring obvious tangible benefits.
arXiv Detail & Related papers (2024-11-13T01:12:35Z)
Don't Use LLMs to Make Relevance Judgments [5.678164657239931]
The recent advent of large language models that produce astoundingly human-like flowing text output in response to a natural language prompt has inspired IR researchers to wonder how those models might be used in the relevance judgment collection process. At the ACM SIGIR 2024 conference, a workshop LLM4Eval'' provided a venue for this work, and featured a data challenge activity where participants reproduced TREC deep learning track judgments. The bottom-line-up-front message is, don't use LLMs to create relevance judgments for TREC-style evaluations.
arXiv Detail & Related papers (2024-09-23T15:38:12Z)
Benchmarking Educational Program Repair [4.981275578987307]
Large language models (LLMs) can be used to generate learning resources, improve error messages, and provide feedback on code. There is a pressing need for standardization and benchmarks that facilitate the equitable comparison of competing approaches. In this article, we propose a novel educational program repair benchmark.
arXiv Detail & Related papers (2024-05-08T18:23:59Z)
LitLLM: A Toolkit for Scientific Literature Review [15.080020634480272]
Toolkit operates on Retrieval Augmented Generation (RAG) principles. System first initiates a web search to retrieve relevant papers. Second, the system re-ranks the retrieved papers based on the user-provided abstract. Third, the related work section is generated based on the re-ranked results and the abstract.
arXiv Detail & Related papers (2024-02-02T02:41:28Z)
Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs. We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z)
Synergistic Interplay between Search and Large Language Models for Information Retrieval [141.18083677333848]
InteR allows RMs to expand knowledge in queries using LLM-generated knowledge collections. InteR achieves overall superior zero-shot retrieval performance compared to state-of-the-art methods.
arXiv Detail & Related papers (2023-05-12T11:58:15Z)
Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents [56.104476412839944]
Large Language Models (LLMs) have demonstrated remarkable zero-shot generalization across various language-related tasks. This paper investigates generative LLMs for relevance ranking in Information Retrieval (IR) To address concerns about data contamination of LLMs, we collect a new test set called NovelEval. To improve efficiency in real-world applications, we delve into the potential for distilling the ranking capabilities of ChatGPT into small specialized models.
arXiv Detail & Related papers (2023-04-19T10:16:03Z)
Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study [115.96080028033904]
We study a scalable pre-trained retrieval-augmented LM (i.e., RETRO) compared with standard GPT and retrieval-augmented GPT. Our findings highlight the promising direction of pretraining autoregressive LMs with retrieval as future foundation models.
arXiv Detail & Related papers (2023-04-13T18:04:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.