Related papers: An Investigation of Prompt Variations for Zero-shot LLM-based Rankers

An Investigation of Prompt Variations for Zero-shot LLM-based Rankers

URL: http://arxiv.org/abs/2406.14117v1
Date: Thu, 20 Jun 2024 09:03:18 GMT
Title: An Investigation of Prompt Variations for Zero-shot LLM-based Rankers
Authors: Shuoqi Sun, Shengyao Zhuang, Shuai Wang, Guido Zuccon,
Abstract summary: We provide a systematic understanding of the impact of specific components and wordings used in prompts on the effectiveness of rankers based on zero-shot Large Language Models (LLMs) It is currently unclear whether performance differences are due to the underlying ranking algorithm, or because of spurious factors such as better choice of words used in prompts.
Score: 28.435970994243615
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We provide a systematic understanding of the impact of specific components and wordings used in prompts on the effectiveness of rankers based on zero-shot Large Language Models (LLMs). Several zero-shot ranking methods based on LLMs have recently been proposed. Among many aspects, methods differ across (1) the ranking algorithm they implement, e.g., pointwise vs. listwise, (2) the backbone LLMs used, e.g., GPT3.5 vs. FLAN-T5, (3) the components and wording used in prompts, e.g., the use or not of role-definition (role-playing) and the actual words used to express this. It is currently unclear whether performance differences are due to the underlying ranking algorithm, or because of spurious factors such as better choice of words used in prompts. This confusion risks to undermine future research. Through our large-scale experimentation and analysis, we find that ranking algorithms do contribute to differences between methods for zero-shot LLM ranking. However, so do the LLM backbones -- but even more importantly, the choice of prompt components and wordings affect the ranking. In fact, in our experiments, we find that, at times, these latter elements have more impact on the ranker's effectiveness than the actual ranking algorithms, and that differences among ranking methods become more blurred when prompt variations are considered.

Related papers

Likert or Not: LLM Absolute Relevance Judgments on Fine-Grained Ordinal Scales [3.4068099825211986]
Two most common prompts to elicit relevance judgments are pointwise scoring and listwise ranking.<n>The current research community consensus is that listwise ranking yields superior performance.<n>In tension with this hypothesis, we find that the gap between pointwise scoring and listwise ranking shrinks when pointwise scoring is implemented using a sufficiently large ordinal relevance label space.
arXiv Detail & Related papers (2025-05-25T21:41:35Z)
Rank-R1: Enhancing Reasoning in LLM-based Document Rerankers via Reinforcement Learning [76.50690734636477]
We introduce Rank-R1, a novel LLM-based reranker that performs reasoning over both the user query and candidate documents before performing the ranking task. Our experiments on the TREC DL and BRIGHT datasets show that Rank-R1 is highly effective, especially for complex queries.
arXiv Detail & Related papers (2025-03-08T03:14:26Z)
Ranking Unraveled: Recipes for LLM Rankings in Head-to-Head AI Combat [7.8905223445925055]
Pairwise ranking has emerged as a new method for evaluating human preferences for large language models (LLM) We explore the effectiveness of ranking systems for head-to-head comparisons of LLMs. Our analysis uncovers key insights into the factors that affect ranking accuracy and efficiency.
arXiv Detail & Related papers (2024-11-19T20:16:26Z)
LLM-RankFusion: Mitigating Intrinsic Inconsistency in LLM-based Ranking [17.96316956366718]
Ranking passages by prompting a large language model (LLM) can achieve promising performance in modern information retrieval (IR) systems. We show that sorting-based methods require consistent comparisons to correctly sort the passages, which we show that LLMs often violate. We propose LLM-RankFusion, an LLM-based ranking framework that mitigates these inconsistencies and produces a robust ranking list.
arXiv Detail & Related papers (2024-05-31T23:29:42Z)
Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves [57.974103113675795]
We present a method named Rephrase and Respond' (RaR) which allows Large Language Models to rephrase and expand questions posed by humans. RaR serves as a simple yet effective prompting method for improving performance. We show that RaR is complementary to the popular Chain-of-Thought (CoT) methods, both theoretically and empirically.
arXiv Detail & Related papers (2023-11-07T18:43:34Z)
Instruction Distillation Makes Large Language Models Efficient Zero-shot Rankers [56.12593882838412]
We introduce a novel instruction distillation method to rank documents. We first rank documents using the effective pairwise approach with complex instructions, and then distill the teacher predictions to the pointwise approach with simpler instructions. Our approach surpasses the performance of existing supervised methods like monoT5 and is on par with the state-of-the-art zero-shot methods.
arXiv Detail & Related papers (2023-11-02T19:16:21Z)
A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models [35.17291316942284]
We propose a novel zero-shot document ranking approach based on Large Language Models (LLMs): the Setwise prompting approach. Our approach complements existing prompting approaches for LLM-based zero-shot ranking: Pointwise, Pairwise, and Listwise.
arXiv Detail & Related papers (2023-10-14T05:20:02Z)
Replace Scoring with Arrangement: A Contextual Set-to-Arrangement Framework for Learning-to-Rank [40.81502990315285]
Learning-to-rank is a core technique in the top-N recommendation task, where an ideal ranker would be a mapping from an item set to an arrangement. Most existing solutions fall in the paradigm of probabilistic ranking principle (PRP), i.e., first score each item in the candidate set and then perform a sort operation to generate the top ranking list. We propose Set-To-Arrangement Ranking (STARank), a new framework directly generates the permutations of the candidate items without the need for individually scoring and sort operations.
arXiv Detail & Related papers (2023-08-05T12:22:26Z)
Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting [65.00288634420812]
Pairwise Ranking Prompting (PRP) is a technique to significantly reduce the burden on Large Language Models (LLMs) Our results are the first in the literature to achieve state-of-the-art ranking performance on standard benchmarks using moderate-sized open-sourced LLMs.
arXiv Detail & Related papers (2023-06-30T11:32:25Z)
RankCSE: Unsupervised Sentence Representations Learning via Learning to Rank [54.854714257687334]
We propose a novel approach, RankCSE, for unsupervised sentence representation learning. It incorporates ranking consistency and ranking distillation with contrastive learning into a unified framework. An extensive set of experiments are conducted on both semantic textual similarity (STS) and transfer (TR) tasks.
arXiv Detail & Related papers (2023-05-26T08:27:07Z)
Learning List-Level Domain-Invariant Representations for Ranking [59.3544317373004]
We propose list-level alignment -- learning domain-invariant representations at the higher level of lists. The benefits are twofold: it leads to the first domain adaptation generalization bound for ranking, in turn providing theoretical support for the proposed method.
arXiv Detail & Related papers (2022-12-21T04:49:55Z)
Which Tricks Are Important for Learning to Rank? [32.38701971636441]
State-of-the-art learning-to-rank methods are based on gradient-boosted decision trees (GBDT) In this paper, we thoroughly analyze several GBDT-based ranking algorithms in a unified setup. As a result, we gain insights into learning-to-rank techniques and obtain a new state-of-the-art algorithm.
arXiv Detail & Related papers (2022-04-04T13:59:04Z)
Online Learning of Optimally Diverse Rankings [63.62764375279861]
We propose an algorithm that efficiently learns the optimal list based on users' feedback only. We show that after $T$ queries, the regret of LDR scales as $O((N-L)log(T))$ where $N$ is the number of all items.
arXiv Detail & Related papers (2021-09-13T12:13:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.