Related papers: Best in Tau@LLMJudge: Criteria-Based Relevance Evaluation with Llama3

Best in Tau@LLMJudge: Criteria-Based Relevance Evaluation with Llama3

URL: http://arxiv.org/abs/2410.14044v1
Date: Thu, 17 Oct 2024 21:37:08 GMT
Title: Best in Tau@LLMJudge: Criteria-Based Relevance Evaluation with Llama3
Authors: Naghmeh Farzi, Laura Dietz,
Abstract summary: We explore alternative methods to prompt large language models (LLMs) for assigned relevance labels. We consider various ways to aggregate criteria-level grades into a relevance label. We include an empirical evaluation of our approaches based on data from the LLMJudge challenge run in Summer 2024.
Score: 5.478764356647438
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Traditional evaluation of information retrieval (IR) systems relies on human-annotated relevance labels, which can be both biased and costly at scale. In this context, large language models (LLMs) offer an alternative by allowing us to directly prompt them to assign relevance labels for passages associated with each query. In this study, we explore alternative methods to directly prompt LLMs for assigned relevance labels, by exploring two hypotheses: Hypothesis 1 assumes that it is helpful to break down "relevance" into specific criteria - exactness, coverage, topicality, and contextual fit. We explore different approaches that prompt large language models (LLMs) to obtain criteria-level grades for all passages, and we consider various ways to aggregate criteria-level grades into a relevance label. Hypothesis 2 assumes that differences in linguistic style between queries and passages may negatively impact the automatic relevance label prediction. We explore whether improvements can be achieved by first synthesizing a summary of the passage in the linguistic style of a query, and then using this summary in place of the passage to assess its relevance. We include an empirical evaluation of our approaches based on data from the LLMJudge challenge run in Summer 2024, where our "Four Prompts" approach obtained the highest scores in Kendall's tau.

Related papers

Benchmarking LLM-based Relevance Judgment Methods [15.255877686845773]
Large Language Models (LLMs) are increasingly deployed in both academic and industry settings. We systematically compare multiple LLM-based relevance assessment methods, including binary relevance judgments, graded relevance assessments, pairwise preference-based methods, and two nugget-based evaluation methods. As part of our data release, we include relevance judgments generated by both an open-source (Llama3.2b) and a commercial (gpt-4o) model.
arXiv Detail & Related papers (2025-04-17T01:13:21Z)
GUM-SAGE: A Novel Dataset and Approach for Graded Entity Salience Prediction [12.172254885579706]
Graded entity salience assigns entities scores that reflect their relative importance in a text. We introduce a novel approach for graded entity salience that combines the strengths of both approaches. Our approach shows stronger correlation with scores based on human summaries and alignments, and outperforms existing techniques.
arXiv Detail & Related papers (2025-04-15T01:26:14Z)
Mitigating Boundary Ambiguity and Inherent Bias for Text Classification in the Era of Large Language Models [24.085614720512744]
This study shows that large language models (LLMs) are vulnerable to changes in the number and arrangement of options in text classification. Key bottleneck arises from ambiguous decision boundaries and inherent biases towards specific tokens and positions. Our approach is grounded in the empirical observation that pairwise comparisons can effectively alleviate boundary ambiguity and inherent bias.
arXiv Detail & Related papers (2024-06-11T06:53:19Z)
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Models [1.1965844936801802]
The purpose of this paper is to identify which specific terms in prompts positively or negatively impact relevance evaluation with Large Language Models. By comparing the performance of these prompts in both few-shot and zero-shot settings, we analyze the influence of specific terms in the prompts.
arXiv Detail & Related papers (2024-05-11T06:30:13Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
Bias and Fairness in Large Language Models: A Survey [73.87651986156006]
We present a comprehensive survey of bias evaluation and mitigation techniques for large language models (LLMs) We first consolidate, formalize, and expand notions of social bias and fairness in natural language processing. We then unify the literature by proposing three intuitive, two for bias evaluation, and one for mitigation.
arXiv Detail & Related papers (2023-09-02T00:32:55Z)
Using Natural Language Explanations to Rescale Human Judgments [81.66697572357477]
We propose a method to rescale ordinal annotations and explanations using large language models (LLMs) We feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric. Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.
arXiv Detail & Related papers (2023-05-24T06:19:14Z)
GaussianMLR: Learning Implicit Class Significance via Calibrated Multi-Label Ranking [0.0]
We propose a novel multi-label ranking method: GaussianMLR. It aims to learn implicit class significance values that determine the positive label ranks. We show that our method is able to accurately learn a representation of the incorporated positive rank order.
arXiv Detail & Related papers (2023-03-07T14:09:08Z)
Task-Specific Embeddings for Ante-Hoc Explainable Text Classification [6.671252951387647]
We propose an alternative training objective in which we learn task-specific embeddings of text. Our proposed objective learns embeddings such that all texts that share the same target class label should be close together. We present extensive experiments which show that the benefits of ante-hoc explainability and incremental learning come at no cost in overall classification accuracy.
arXiv Detail & Related papers (2022-11-30T19:56:25Z)
Towards Human-Centred Explainability Benchmarks For Text Classification [4.393754160527062]
We propose to extend text classification benchmarks to evaluate the explainability of text classifiers. We review challenges associated with objectively evaluating the capabilities to produce valid explanations. We propose to ground these benchmarks in human-centred applications.
arXiv Detail & Related papers (2022-11-10T09:52:31Z)
UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query. Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms. We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z)
Distant finetuning with discourse relations for stance classification [55.131676584455306]
We propose a new method to extract data with silver labels from raw text to finetune a model for stance classification. We also propose a 3-stage training framework where the noisy level in the data used for finetuning decreases over different stages. Our approach ranks 1st among 26 competing teams in the stance classification track of the NLPCC 2021 shared task Argumentative Text Understanding for AI Debater.
arXiv Detail & Related papers (2022-04-27T04:24:35Z)
Weakly-Supervised Aspect-Based Sentiment Analysis via Joint Aspect-Sentiment Topic Embedding [71.2260967797055]
We propose a weakly-supervised approach for aspect-based sentiment analysis. We learn sentiment, aspect> joint topic embeddings in the word embedding space. We then use neural models to generalize the word-level discriminative information.
arXiv Detail & Related papers (2020-10-13T21:33:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.