Related papers: LLM-RankFusion: Mitigating Intrinsic Inconsistency in LLM-based Ranking

LLM-RankFusion: Mitigating Intrinsic Inconsistency in LLM-based Ranking

URL: http://arxiv.org/abs/2406.00231v1
Date: Fri, 31 May 2024 23:29:42 GMT
Title: LLM-RankFusion: Mitigating Intrinsic Inconsistency in LLM-based Ranking
Authors: Yifan Zeng, Ojas Tendolkar, Raymond Baartmans, Qingyun Wu, Huazheng Wang, Lizhong Chen,
Abstract summary: Ranking passages by prompting a large language model (LLM) can achieve promising performance in modern information retrieval (IR) systems. We show that sorting-based methods require consistent comparisons to correctly sort the passages, which we show that LLMs often violate. We propose LLM-RankFusion, an LLM-based ranking framework that mitigates these inconsistencies and produces a robust ranking list.
Score: 17.96316956366718
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Ranking passages by prompting a large language model (LLM) can achieve promising performance in modern information retrieval (IR) systems. A common approach is to sort the ranking list by prompting LLMs for pairwise comparison. However, sorting-based methods require consistent comparisons to correctly sort the passages, which we show that LLMs often violate. We identify two kinds of intrinsic inconsistency in LLM-based pairwise comparisons: order inconsistency which leads to conflicting results when switching the passage order, and transitive inconsistency which leads to non-transitive triads among all preference pairs. In this paper, we propose LLM-RankFusion, an LLM-based ranking framework that mitigates these inconsistencies and produces a robust ranking list. LLM-RankFusion mitigates order inconsistency using in-context learning (ICL) to demonstrate order-agnostic comparisons and calibration to estimate the underlying preference probability between two passages. We then address transitive inconsistency by aggregating the ranking results from multiple rankers. In our experiments, we empirically show that LLM-RankFusion can significantly reduce inconsistent pairwise comparison results, and improve the ranking quality by making the final ranking list more robust.

Related papers

Likert or Not: LLM Absolute Relevance Judgments on Fine-Grained Ordinal Scales [3.4068099825211986]
Two most common prompts to elicit relevance judgments are pointwise scoring and listwise ranking.<n>The current research community consensus is that listwise ranking yields superior performance.<n>In tension with this hypothesis, we find that the gap between pointwise scoring and listwise ranking shrinks when pointwise scoring is implemented using a sufficiently large ordinal relevance label space.
arXiv Detail & Related papers (2025-05-25T21:41:35Z)
CoRanking: Collaborative Ranking with Small and Large Ranking Agents [39.98101653077503]
Large Language Models (LLMs) have demonstrated superior listwise ranking performance. CoRanking combines small and large ranking models for efficient and effective ranking.
arXiv Detail & Related papers (2025-03-30T13:00:52Z)
Ranking Unraveled: Recipes for LLM Rankings in Head-to-Head AI Combat [7.8905223445925055]
Pairwise ranking has emerged as a new method for evaluating human preferences for large language models (LLM) We explore the effectiveness of ranking systems for head-to-head comparisons of LLMs. Our analysis uncovers key insights into the factors that affect ranking accuracy and efficiency.
arXiv Detail & Related papers (2024-11-19T20:16:26Z)
TSPRank: Bridging Pairwise and Listwise Methods with a Bilinear Travelling Salesman Model [19.7255072094322]
Travelling Salesman Problem Rank (TSPRank) is a hybrid pairwise-listwise ranking method. TSPRank's robustness and superior performance across different domains highlight its potential as a versatile and effective LETOR solution.
arXiv Detail & Related papers (2024-11-18T21:10:14Z)
Self-Calibrated Listwise Reranking with Large Language Models [137.6557607279876]
Large language models (LLMs) have been employed in reranking tasks through a sequence-to-sequence approach. This reranking paradigm requires a sliding window strategy to iteratively handle larger candidate sets. We propose a novel self-calibrated listwise reranking method, which aims to leverage LLMs to produce global relevance scores for ranking.
arXiv Detail & Related papers (2024-11-07T10:31:31Z)
FIRST: Faster Improved Listwise Reranking with Single Token Decoding [56.727761901751194]
First, we introduce FIRST, a novel listwise LLM reranking approach leveraging the output logits of the first generated identifier to directly obtain a ranked ordering of the candidates. Empirical results demonstrate that FIRST accelerates inference by 50% while maintaining a robust ranking performance with gains across the BEIR benchmark. Our results show that LLM rerankers can provide a stronger distillation signal compared to cross-encoders, yielding substantial improvements in retriever recall after relevance feedback.
arXiv Detail & Related papers (2024-06-21T21:27:50Z)
An Investigation of Prompt Variations for Zero-shot LLM-based Rankers [28.435970994243615]
We provide a systematic understanding of the impact of specific components and wordings used in prompts on the effectiveness of rankers based on zero-shot Large Language Models (LLMs) It is currently unclear whether performance differences are due to the underlying ranking algorithm, or because of spurious factors such as better choice of words used in prompts.
arXiv Detail & Related papers (2024-06-20T09:03:18Z)
LiPO: Listwise Preference Optimization through Learning-to-Rank [62.02782819559389]
Policy can learn more effectively from a ranked list of plausible responses given the prompt. We show that LiPO-$lambda$ can outperform DPO variants and SLiC by a clear margin on several preference alignment tasks.
arXiv Detail & Related papers (2024-02-02T20:08:10Z)
Fake Alignment: Are LLMs Really Aligned Well? [91.26543768665778]
This study investigates the substantial discrepancy in performance between multiple-choice questions and open-ended questions. Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization.
arXiv Detail & Related papers (2023-11-10T08:01:23Z)
Tuna: Instruction Tuning using Feedback from Large Language Models [74.04950416204551]
We propose finetuning an instruction-tuned large language model using our novel textitprobabilistic ranking and textitcontextual ranking approaches. Probabilistic ranking enables the instruction-tuned model to inherit the relative rankings of high-quality and low-quality responses from the teacher LLM. On the other hand, learning with contextual ranking allows the model to refine its own response distribution using the contextual understanding ability of stronger LLMs.
arXiv Detail & Related papers (2023-10-20T09:55:06Z)
Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models [63.714662435555674]
Large language models (LLMs) exhibit positional bias in how they use context. We propose permutation self-consistency, a form of self-consistency over ranking list outputs of black-box LLMs. Our approach improves scores from conventional inference by up to 7-18% for GPT-3.5 and 8-16% for LLaMA v2 (70B)
arXiv Detail & Related papers (2023-10-11T17:59:02Z)
Unsupervised Contrast-Consistent Ranking with Language Models [24.696017700382665]
Language models contain ranking-based knowledge and are powerful solvers of in-context ranking tasks. We compare pairwise, pointwise and listwise prompting techniques to elicit a language model's ranking knowledge. We find that even with careful calibration and constrained decoding, prompting-based techniques may not always be self-consistent in the rankings they produce.
arXiv Detail & Related papers (2023-09-13T14:36:26Z)
Bipartite Ranking Fairness through a Model Agnostic Ordering Adjustment [54.179859639868646]
We propose a model agnostic post-processing framework xOrder for achieving fairness in bipartite ranking. xOrder is compatible with various classification models and ranking fairness metrics, including supervised and unsupervised fairness metrics. We evaluate our proposed algorithm on four benchmark data sets and two real-world patient electronic health record repositories.
arXiv Detail & Related papers (2023-07-27T07:42:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.