Related papers: Efficient LLM Comparative Assessment: a Product of Experts Framework for Pairwise Comparisons

Efficient LLM Comparative Assessment: a Product of Experts Framework for Pairwise Comparisons

URL: http://arxiv.org/abs/2405.05894v2
Date: Sun, 9 Jun 2024 17:56:11 GMT
Title: Efficient LLM Comparative Assessment: a Product of Experts Framework for Pairwise Comparisons
Authors: Adian Liusie, Vatsal Raina, Yassir Fathullah, Mark Gales,
Abstract summary: This paper introduces a Product of Expert (PoE) framework for efficient Comparative Assessment. Individual comparisons are considered experts that provide information on a pair's score difference. PoE framework combines the information from these experts to yield an expression that can be maximized with respect to the underlying set of candidates.
Score: 10.94304714004328
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLM-as-a-judge approaches are a practical and effective way of assessing a range of text tasks, aligning with human judgements especially when applied in a comparative assessment fashion. However, when using pairwise comparisons to rank a set of candidates the computational costs scale quadratically with the number of candidates, which can have practical limitations. This paper introduces a Product of Expert (PoE) framework for efficient LLM Comparative Assessment. Here individual comparisons are considered experts that provide information on a pair's score difference. The PoE framework combines the information from these experts to yield an expression that can be maximized with respect to the underlying set of candidates, and is highly flexible where any form of expert can be assumed. When Gaussian experts are used one can derive simple closed-form solutions for the optimal candidate ranking, as well as expressions for selecting which comparisons should be made to maximize the probability of this ranking. Our approach enables efficient comparative assessment, where by using only a small subset of the possible comparisons, one can generate score predictions that correlate as well to human judgements as the predictions when all comparisons are used. We evaluate the approach on multiple NLG tasks and demonstrate that our framework can yield considerable computational savings when performing pairwise comparative assessment. When N is large, with as few as 2% of comparisons the PoE solution can achieve similar performance to when all comparisons are used.

Related papers

RecSys Arena: Pair-wise Recommender System Evaluation with Large Language Models [40.74293642666989]
We present the idea of RecSys Arena, where the recommendation results given by two different recommender systems are evaluated by an LLM judger to obtain fine-grained evaluation feedback. We demonstrate that many different LLMs provide general evaluation results that are highly consistent with canonical offline metrics. It can better distinguish different algorithms with comparable performance in terms of AUC and nDCG.
arXiv Detail & Related papers (2024-12-15T05:57:36Z)
AIME: AI System Optimization via Multiple LLM Evaluators [79.03422337674664]
AIME is an evaluation protocol that utilizes multiple LLMs that each independently generate an evaluation on separate criteria and then combine them via concatenation. We show AIME outperforming baseline methods in code generation tasks, with up to $62%$ higher error detection rate and up to $16%$ higher success rate than a single LLM evaluation protocol on LeetCodeHard and HumanEval datasets.
arXiv Detail & Related papers (2024-10-04T04:03:24Z)
Efficient Pointwise-Pairwise Learning-to-Rank for News Recommendation [6.979979613916754]
News recommendation is a challenging task that involves personalization based on the interaction history and preferences of each user. Recent works have leveraged the power of pretrained language models (PLMs) to directly rank news items by using inference approaches that predominately fall into three categories: pointwise, pairwise, and listwise learning-to-rank. We propose a novel framework for PLM-based news recommendation that integrates both pointwise relevance prediction and pairwise comparisons in a scalable manner.
arXiv Detail & Related papers (2024-09-26T10:27:19Z)
Finetuning LLMs for Comparative Assessment Tasks [9.05771474043499]
We propose a framework for finetuning large language models for comparative assessment. By training on soft probabilities, our approach improves state-of-the-art performance.
arXiv Detail & Related papers (2024-09-24T11:21:43Z)
Not All Preference Pairs Are Created Equal: A Recipe for Annotation-Efficient Iterative Preference Learning [81.69044784288005]
Iterative preference learning requires online annotated preference labels. We study strategies to select worth-annotating response pairs for cost-efficient annotation.
arXiv Detail & Related papers (2024-06-25T06:49:16Z)
The Comparative Trap: Pairwise Comparisons Amplifies Biased Preferences of LLM Evaluators [31.520403357740317]
Large language models (LLMs) are increasingly used as evaluators for natural language generation tasks. LLMs display biased preferences, such as favoring verbosity and authoritative tones. We introduce PRePair, which integrates pointwise reasoning within a pairwise framework.
arXiv Detail & Related papers (2024-06-18T06:43:04Z)
Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization [105.3612692153615]
We propose a new axis based on eliciting preferences jointly over instruction-response pairs. Joint preferences over instruction and response pairs can significantly enhance the alignment of large language models.
arXiv Detail & Related papers (2024-03-31T02:05:40Z)
LLM Comparative Assessment: Zero-shot NLG Evaluation through Pairwise Comparisons using Large Language Models [55.60306377044225]
Large language models (LLMs) have enabled impressive zero-shot capabilities across various natural language tasks. This paper explores two options for exploiting the emergent abilities of LLMs for zero-shot NLG assessment. For moderate-sized open-source LLMs, such as FlanT5 and Llama2-chat, comparative assessment is superior to prompt scoring.
arXiv Detail & Related papers (2023-07-15T22:02:12Z)
Ranking from Pairwise Comparisons in General Graphs and Graphs with Locality [3.1219977244201056]
This technical report studies the problem of ranking from pairwise comparisons in the classical Bradley-Terry-Luce (BTL) model. We show that, with sufficiently many samples, maximum likelihood estimation (MLE) achieves an entrywise estimation error matching the Cram'er-Rao lower bound. We explore divide-and-conquer algorithms that can provably achieve similar guarantees even in the regime with the sparsest samples.
arXiv Detail & Related papers (2023-04-13T21:14:30Z)
Adaptive Sampling for Heterogeneous Rank Aggregation from Noisy Pairwise Comparisons [85.5955376526419]
In rank aggregation problems, users exhibit various accuracy levels when comparing pairs of items. We propose an elimination-based active sampling strategy, which estimates the ranking of items via noisy pairwise comparisons. We prove that our algorithm can return the true ranking of items with high probability.
arXiv Detail & Related papers (2021-10-08T13:51:55Z)
Scalable Personalised Item Ranking through Parametric Density Estimation [53.44830012414444]
Learning from implicit feedback is challenging because of the difficult nature of the one-class problem. Most conventional methods use a pairwise ranking approach and negative samplers to cope with the one-class problem. We propose a learning-to-rank approach, which achieves convergence speed comparable to the pointwise counterpart.
arXiv Detail & Related papers (2021-05-11T03:38:16Z)
Ranking a set of objects: a graph based least-square approach [70.7866286425868]
We consider the problem of ranking $N$ objects starting from a set of noisy pairwise comparisons provided by a crowd of equal workers. We propose a class of non-adaptive ranking algorithms that rely on a least-squares intrinsic optimization criterion for the estimation of qualities.
arXiv Detail & Related papers (2020-02-26T16:19:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.