LLM-RankFusion: Mitigating Intrinsic Inconsistency in LLM-based Ranking
- URL: http://arxiv.org/abs/2406.00231v1
- Date: Fri, 31 May 2024 23:29:42 GMT
- Title: LLM-RankFusion: Mitigating Intrinsic Inconsistency in LLM-based Ranking
- Authors: Yifan Zeng, Ojas Tendolkar, Raymond Baartmans, Qingyun Wu, Huazheng Wang, Lizhong Chen,
- Abstract summary: Ranking passages by prompting a large language model (LLM) can achieve promising performance in modern information retrieval (IR) systems.
We show that sorting-based methods require consistent comparisons to correctly sort the passages, which we show that LLMs often violate.
We propose LLM-RankFusion, an LLM-based ranking framework that mitigates these inconsistencies and produces a robust ranking list.
- Score: 17.96316956366718
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Ranking passages by prompting a large language model (LLM) can achieve promising performance in modern information retrieval (IR) systems. A common approach is to sort the ranking list by prompting LLMs for pairwise comparison. However, sorting-based methods require consistent comparisons to correctly sort the passages, which we show that LLMs often violate. We identify two kinds of intrinsic inconsistency in LLM-based pairwise comparisons: order inconsistency which leads to conflicting results when switching the passage order, and transitive inconsistency which leads to non-transitive triads among all preference pairs. In this paper, we propose LLM-RankFusion, an LLM-based ranking framework that mitigates these inconsistencies and produces a robust ranking list. LLM-RankFusion mitigates order inconsistency using in-context learning (ICL) to demonstrate order-agnostic comparisons and calibration to estimate the underlying preference probability between two passages. We then address transitive inconsistency by aggregating the ranking results from multiple rankers. In our experiments, we empirically show that LLM-RankFusion can significantly reduce inconsistent pairwise comparison results, and improve the ranking quality by making the final ranking list more robust.
Related papers
- An Investigation of Prompt Variations for Zero-shot LLM-based Rankers [28.435970994243615]
We provide a systematic understanding of the impact of specific components and wordings used in prompts on the effectiveness of rankers based on zero-shot Large Language Models (LLMs)
It is currently unclear whether performance differences are due to the underlying ranking algorithm, or because of spurious factors such as better choice of words used in prompts.
arXiv Detail & Related papers (2024-06-20T09:03:18Z) - Make Large Language Model a Better Ranker [20.532118635672763]
This paper introduces the large language model framework with Aligned Listwise Ranking Objectives (ALRO)
ALRO is designed to bridge the gap between the capabilities of LLMs and nuanced requirements of ranking tasks.
Our evaluative studies reveal that ALRO outperforms both existing embedding-based recommendation methods and LLM-based recommendation baselines.
arXiv Detail & Related papers (2024-03-28T07:22:16Z) - LiPO: Listwise Preference Optimization through Learning-to-Rank [62.02782819559389]
Policy can learn more effectively from a ranked list of plausible responses given the prompt.
We show that LiPO-$lambda$ can outperform DPO variants and SLiC by a clear margin on several preference alignment tasks.
arXiv Detail & Related papers (2024-02-02T20:08:10Z) - The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context
Learning [61.68787689234622]
A recent study, LIMA, shows that using merely 1K examples for alignment tuning can achieve significant alignment performance as well.
This raises questions about how exactly the alignment tuning transforms a base LLM.
We show that the gap between tuning-free and tuning-based alignment methods can be significantly reduced through strategic prompting.
arXiv Detail & Related papers (2023-12-04T00:46:11Z) - Fake Alignment: Are LLMs Really Aligned Well? [91.26543768665778]
This study investigates the substantial discrepancy in performance between multiple-choice questions and open-ended questions.
Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization.
arXiv Detail & Related papers (2023-11-10T08:01:23Z) - Tuna: Instruction Tuning using Feedback from Large Language Models [74.04950416204551]
We propose finetuning an instruction-tuned large language model using our novel textitprobabilistic ranking and textitcontextual ranking approaches.
Probabilistic ranking enables the instruction-tuned model to inherit the relative rankings of high-quality and low-quality responses from the teacher LLM.
On the other hand, learning with contextual ranking allows the model to refine its own response distribution using the contextual understanding ability of stronger LLMs.
arXiv Detail & Related papers (2023-10-20T09:55:06Z) - Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models [63.714662435555674]
Large language models (LLMs) exhibit positional bias in how they use context.
We propose permutation self-consistency, a form of self-consistency over ranking list outputs of black-box LLMs.
Our approach improves scores from conventional inference by up to 7-18% for GPT-3.5 and 8-16% for LLaMA v2 (70B)
arXiv Detail & Related papers (2023-10-11T17:59:02Z) - Unsupervised Contrast-Consistent Ranking with Language Models [24.696017700382665]
Language models contain ranking-based knowledge and are powerful solvers of in-context ranking tasks.
We compare pairwise, pointwise and listwise prompting techniques to elicit a language model's ranking knowledge.
We find that even with careful calibration and constrained decoding, prompting-based techniques may not always be self-consistent in the rankings they produce.
arXiv Detail & Related papers (2023-09-13T14:36:26Z) - Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting [65.00288634420812]
Pairwise Ranking Prompting (PRP) is a technique to significantly reduce the burden on Large Language Models (LLMs)
Our results are the first in the literature to achieve state-of-the-art ranking performance on standard benchmarks using moderate-sized open-sourced LLMs.
arXiv Detail & Related papers (2023-06-30T11:32:25Z) - LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and
Generative Fusion [33.73671362609599]
Our framework consists of two modules: PairRanker and GenFuser.
PairRanker employs a specialized pairwise comparison method to distinguish subtle differences between candidate outputs.
GenFuser aims to merge the top-ranked candidates, generating an improved output.
arXiv Detail & Related papers (2023-06-05T03:32:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.