Related papers: LLM-Assisted Relevance Assessments: When Should We Ask LLMs for Help?

LLM-Assisted Relevance Assessments: When Should We Ask LLMs for Help?

URL: http://arxiv.org/abs/2411.06877v1
Date: Mon, 11 Nov 2024 11:17:35 GMT
Title: LLM-Assisted Relevance Assessments: When Should We Ask LLMs for Help?
Authors: Rikiya Takehi, Ellen M. Voorhees, Tetsuya Sakai,
Abstract summary: Test collections are information retrieval tools that allow researchers to quickly and easily evaluate ranking algorithms. We propose textbfLLM-textbfAssisted textbfRelevance textbfAssessments (textbfLARA) to balance manual annotations with LLM annotations.
Score: 18.663118865354427
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Test collections are information retrieval tools that allow researchers to quickly and easily evaluate ranking algorithms. While test collections have become an integral part of IR research, the process of data creation involves significant efforts in manual annotations, which often makes it very expensive and time-consuming. Thus, the test collections could become small when the budget is limited, which may lead to unstable evaluations. As an alternative, recent studies have proposed the use of large language models (LLMs) to completely replace human assessors. However, while LLMs seem to somewhat correlate with human judgments, they are not perfect and often show bias. Moreover, even if a well-performing LLM or prompt is found on one dataset, there is no guarantee that it will perform similarly in practice, due to difference in tasks and data. Thus a complete replacement with LLMs is argued to be too risky and not fully trustable. Thus, in this paper, we propose \textbf{L}LM-\textbf{A}ssisted \textbf{R}elevance \textbf{A}ssessments (\textbf{LARA}), an effective method to balance manual annotations with LLM annotations, which helps to make a rich and reliable test collection. We use the LLM's predicted relevance probabilities in order to select the most profitable documents to manually annotate under a budget constraint. While solely relying on LLM's predicted probabilities to manually annotate performs fairly well, with theoretical reasoning, LARA guides the human annotation process even more effectively via online calibration learning. Then, using the calibration model learned from the limited manual annotations, LARA debiases the LLM predictions to annotate the remaining non-assessed data. Empirical evaluations on TREC-COVID and TREC-8 Ad Hoc datasets show that LARA outperforms the alternative solutions under almost any budget constraint.

Related papers

LLMs as Data Annotators: How Close Are We to Human Performance [47.61698665650761]
Manual annotation of data is labor-intensive, time-consuming, and costly. In-context learning (ICL) in which some examples related to the task are given in the prompt can lead to inefficiencies and suboptimal model performance. This paper presents experiments comparing several LLMs, considering different embedding models, across various datasets for the Named Entity Recognition (NER) task.
arXiv Detail & Related papers (2025-04-21T11:11:07Z)
Dynamic Uncertainty Ranking: Enhancing Retrieval-Augmented In-Context Learning for Long-Tail Knowledge in LLMs [50.29035873837]
Large language models (LLMs) can learn vast amounts of knowledge from diverse domains during pre-training. Long-tail knowledge from specialized domains is often scarce and underrepresented, rarely appearing in the models' memorization. We propose a reinforcement learning-based dynamic uncertainty ranking method for ICL that accounts for the varying impact of each retrieved sample on LLM predictions.
arXiv Detail & Related papers (2024-10-31T03:42:17Z)
Grounding by Trying: LLMs with Reinforcement Learning-Enhanced Retrieval [55.63711219190506]
Large language models (LLMs) often struggle with posing the right search queries. We introduce $underlineLe$arning to $underlineRe$trieve by $underlineT$rying (LeReT) LeReT can improve the absolute retrieval accuracy by up to 29% and the downstream generator evaluations by 17%.
arXiv Detail & Related papers (2024-10-30T17:02:54Z)
LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints [86.59857711385833]
We introduce RealInstruct, the first benchmark designed to evaluate LLMs' ability to follow real-world multi-constrained instructions. To address the performance gap between open-source and proprietary models, we propose the Decompose, Critique and Refine (DeCRIM) self-correction pipeline. Our results show that DeCRIM improves Mistral's performance by 7.3% on RealInstruct and 8.0% on IFEval even with weak feedback.
arXiv Detail & Related papers (2024-10-09T01:25:10Z)
TICKing All the Boxes: Generated Checklists Improve LLM Evaluation and Generation [24.954629877691623]
TICK (Targeted Instruct-evaluation with ChecKlists) is a fully automated, interpretable evaluation protocol. We first show that, given an instruction, LLMs can reliably produce high-quality, tailored evaluation checklists. We then show that STICK can be used to improve generation quality across multiple benchmarks via self-refinement and Best-of-N selection.
arXiv Detail & Related papers (2024-10-04T17:09:08Z)
Enhancing Text Classification through LLM-Driven Active Learning and Human Annotation [2.0411082897313984]
This study introduces a novel methodology that integrates human annotators and Large Language Models. The proposed framework integrates human annotation with the output of LLMs, depending on the model uncertainty levels. The empirical results show a substantial decrease in the costs associated with data annotation while either maintaining or improving model accuracy.
arXiv Detail & Related papers (2024-06-17T21:45:48Z)
Large Language Models are Inconsistent and Biased Evaluators [2.136983452580014]
We show that Large Language Models (LLMs) are biased evaluators as they exhibit familiarity bias and show skewed distributions of ratings. We also found that LLMs are inconsistent evaluators, showing low "inter-sample" agreement and sensitivity to prompt differences that are insignificant to human understanding of text quality.
arXiv Detail & Related papers (2024-05-02T20:42:28Z)
Evaluating the Factuality of Large Language Models using Large-Scale Knowledge Graphs [30.179703001666173]
Factuality issue is a critical concern for Large Language Models (LLMs) We propose GraphEval to evaluate an LLM's performance using a substantially large test dataset. Test dataset is retrieved from a large knowledge graph with more than 10 million facts without expensive human efforts.
arXiv Detail & Related papers (2024-04-01T06:01:17Z)
$\forall$uto$\exists$val: Autonomous Assessment of LLMs in Formal Synthesis and Interpretation Tasks [21.12437562185667]
This paper presents a new approach for scaling LLM assessment in translating formal syntax to natural language. We use context-free grammars (CFGs) to generate out-of-distribution datasets on the fly. We also conduct an assessment of several SOTA closed and open-source LLMs to showcase the feasibility and scalability of this paradigm.
arXiv Detail & Related papers (2024-03-27T08:08:00Z)
LLatrieval: LLM-Verified Retrieval for Verifiable Generation [67.93134176912477]
Verifiable generation aims to let the large language model (LLM) generate text with supporting documents. We propose LLatrieval (Large Language Model Verified Retrieval), where the LLM updates the retrieval result until it verifies that the retrieved documents can sufficiently support answering the question. Experiments show that LLatrieval significantly outperforms extensive baselines and achieves state-of-the-art results.
arXiv Detail & Related papers (2023-11-14T01:38:02Z)
CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large Language Models for Data Annotation [94.59630161324013]
We propose CoAnnotating, a novel paradigm for Human-LLM co-annotation of unstructured texts at scale. Our empirical study shows CoAnnotating to be an effective means to allocate work from results on different datasets, with up to 21% performance improvement over random baseline.
arXiv Detail & Related papers (2023-10-24T08:56:49Z)
TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety. Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs. We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z)
BooookScore: A systematic exploration of book-length summarization in the era of LLMs [53.42917858142565]
We develop an automatic metric, BooookScore, that measures the proportion of sentences in a summary that do not contain any of the identified error types. We find that closed-source LLMs such as GPT-4 and 2 produce summaries with higher BooookScore than those generated by open-source models.
arXiv Detail & Related papers (2023-10-01T20:46:44Z)
Using Large Language Models for Qualitative Analysis can Introduce Serious Bias [0.09208007322096534]
Large Language Models (LLMs) are quickly becoming ubiquitous, but the implications for social science research are not yet well understood. This paper asks whether LLMs can help us analyse large-N qualitative data from open-ended interviews, with an application to transcripts of interviews with Rohingya refugees in Cox's Bazaar, Bangladesh. We find that a great deal of caution is needed in using LLMs to annotate text as there is a risk of introducing biases that can lead to misleading inferences.
arXiv Detail & Related papers (2023-09-29T11:19:15Z)
On Learning to Summarize with Large Language Models as References [101.79795027550959]
Large language models (LLMs) are favored by human annotators over the original reference summaries in commonly used summarization datasets. We study an LLM-as-reference learning setting for smaller text summarization models to investigate whether their performance can be substantially improved.
arXiv Detail & Related papers (2023-05-23T16:56:04Z)
Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization. We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.