Can We Use Large Language Models to Fill Relevance Judgment Holes?
- URL: http://arxiv.org/abs/2405.05600v1
- Date: Thu, 9 May 2024 07:39:19 GMT
- Title: Can We Use Large Language Models to Fill Relevance Judgment Holes?
- Authors: Zahra Abbasiantaeb, Chuan Meng, Leif Azzopardi, Mohammad Aliannejadi,
- Abstract summary: We take initial steps towards extending existing test collections by employing Large Language Models (LLM) to fill the holes.
We find substantially lower correlates when human plus automatic judgments are used.
- Score: 9.208308067952155
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Incomplete relevance judgments limit the re-usability of test collections. When new systems are compared against previous systems used to build the pool of judged documents, they often do so at a disadvantage due to the ``holes'' in test collection (i.e., pockets of un-assessed documents returned by the new system). In this paper, we take initial steps towards extending existing test collections by employing Large Language Models (LLM) to fill the holes by leveraging and grounding the method using existing human judgments. We explore this problem in the context of Conversational Search using TREC iKAT, where information needs are highly dynamic and the responses (and, the results retrieved) are much more varied (leaving bigger holes). While previous work has shown that automatic judgments from LLMs result in highly correlated rankings, we find substantially lower correlates when human plus automatic judgments are used (regardless of LLM, one/two/few shot, or fine-tuned). We further find that, depending on the LLM employed, new runs will be highly favored (or penalized), and this effect is magnified proportionally to the size of the holes. Instead, one should generate the LLM annotations on the whole document pool to achieve more consistent rankings with human-generated labels. Future work is required to prompt engineering and fine-tuning LLMs to reflect and represent the human annotations, in order to ground and align the models, such that they are more fit for purpose.
Related papers
- On the Statistical Significance with Relevance Assessments of Large Language Models [2.9180406633632523]
We use Large Language Models for labelling relevance of documents for building new retrieval test collections.
Our results show that LLM judgements detect most of the significant differences while maintaining acceptable numbers of false positives.
Our work represents a step forward in the evaluation of statistical testing results provided by LLM judgements.
arXiv Detail & Related papers (2024-11-20T11:19:35Z) - Self-Calibrated Listwise Reranking with Large Language Models [137.6557607279876]
Large language models (LLMs) have been employed in reranking tasks through a sequence-to-sequence approach.
This reranking paradigm requires a sliding window strategy to iteratively handle larger candidate sets.
We propose a novel self-calibrated listwise reranking method, which aims to leverage LLMs to produce global relevance scores for ranking.
arXiv Detail & Related papers (2024-11-07T10:31:31Z) - Zero-Shot Dense Retrieval with Embeddings from Relevance Feedback [17.986392250269606]
We introduce Real Document Embeddings from Relevance Feedback (ReDE-RF)
Inspired by relevance feedback, ReDE-RF proposes to re-frame hypothetical document generation as a relevance estimation task.
Our experiments show that ReDE-RF consistently surpasses state-of-the-art zero-shot dense retrieval methods.
arXiv Detail & Related papers (2024-10-28T17:40:40Z) - Re-Ranking Step by Step: Investigating Pre-Filtering for Re-Ranking with Large Language Models [5.0490573482829335]
Large Language Models (LLMs) have been revolutionizing a myriad of natural language processing tasks with their diverse zero-shot capabilities.
This paper investigates the use of a pre-filtering step before passage re-ranking in information retrieval (IR)
Our experiments show that this pre-filtering then allows the LLM to perform significantly better at the re-ranking task.
arXiv Detail & Related papers (2024-06-26T20:12:24Z) - Efficient Document Ranking with Learnable Late Interactions [73.41976017860006]
Cross-Encoder (CE) and Dual-Encoder (DE) models are two fundamental approaches for query-document relevance in information retrieval.
To predict relevance, CE models use joint query-document embeddings, while DE models maintain factorized query and document embeddings.
Recently, late-interaction models have been proposed to realize more favorable latency-quality tradeoffs, by using a DE structure followed by a lightweight scorer.
arXiv Detail & Related papers (2024-06-25T22:50:48Z) - WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia [59.96425443250666]
Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs)
In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions based on contradictory passages from Wikipedia.
We benchmark a diverse range of both closed and open-source LLMs under different QA scenarios, including RAG with a single passage, and RAG with 2 contradictory passages.
arXiv Detail & Related papers (2024-06-19T20:13:42Z) - CaLM: Contrasting Large and Small Language Models to Verify Grounded Generation [76.31621715032558]
Grounded generation aims to equip language models (LMs) with the ability to produce more credible and accountable responses.
We introduce CaLM, a novel verification framework.
Our framework empowers smaller LMs, which rely less on parametric memory, to validate the output of larger LMs.
arXiv Detail & Related papers (2024-06-08T06:04:55Z) - List-aware Reranking-Truncation Joint Model for Search and
Retrieval-augmented Generation [80.12531449946655]
We propose a Reranking-Truncation joint model (GenRT) that can perform the two tasks concurrently.
GenRT integrates reranking and truncation via generative paradigm based on encoder-decoder architecture.
Our method achieves SOTA performance on both reranking and truncation tasks for web search and retrieval-augmented LLMs.
arXiv Detail & Related papers (2024-02-05T06:52:53Z) - Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization.
We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z) - Automatic Evaluation of Attribution by Large Language Models [24.443271739599194]
We investigate the automatic evaluation of attribution given by large language models (LLMs)
We begin by defining different types of attribution errors, and then explore two approaches for automatic evaluation.
We manually curate a set of test examples covering 12 domains from a generative search engine, New Bing.
arXiv Detail & Related papers (2023-05-10T16:58:33Z) - Characterizing Attribution and Fluency Tradeoffs for Retrieval-Augmented
Large Language Models [6.425088990363101]
We examine the relationship between fluency and attribution in Large Language Models prompted with retrieved evidence.
We show that larger models tend to do much better in both fluency and attribution.
We propose a recipe that could allow smaller models to both close the gap with larger models and preserve the benefits of top-k retrieval.
arXiv Detail & Related papers (2023-02-11T02:43:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.