Efficient LLM Comparative Assessment: a Product of Experts Framework for Pairwise Comparisons
- URL: http://arxiv.org/abs/2405.05894v2
- Date: Sun, 9 Jun 2024 17:56:11 GMT
- Title: Efficient LLM Comparative Assessment: a Product of Experts Framework for Pairwise Comparisons
- Authors: Adian Liusie, Vatsal Raina, Yassir Fathullah, Mark Gales,
- Abstract summary: This paper introduces a Product of Expert (PoE) framework for efficient Comparative Assessment.
Individual comparisons are considered experts that provide information on a pair's score difference.
PoE framework combines the information from these experts to yield an expression that can be maximized with respect to the underlying set of candidates.
- Score: 10.94304714004328
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: LLM-as-a-judge approaches are a practical and effective way of assessing a range of text tasks, aligning with human judgements especially when applied in a comparative assessment fashion. However, when using pairwise comparisons to rank a set of candidates the computational costs scale quadratically with the number of candidates, which can have practical limitations. This paper introduces a Product of Expert (PoE) framework for efficient LLM Comparative Assessment. Here individual comparisons are considered experts that provide information on a pair's score difference. The PoE framework combines the information from these experts to yield an expression that can be maximized with respect to the underlying set of candidates, and is highly flexible where any form of expert can be assumed. When Gaussian experts are used one can derive simple closed-form solutions for the optimal candidate ranking, as well as expressions for selecting which comparisons should be made to maximize the probability of this ranking. Our approach enables efficient comparative assessment, where by using only a small subset of the possible comparisons, one can generate score predictions that correlate as well to human judgements as the predictions when all comparisons are used. We evaluate the approach on multiple NLG tasks and demonstrate that our framework can yield considerable computational savings when performing pairwise comparative assessment. When N is large, with as few as 2% of comparisons the PoE solution can achieve similar performance to when all comparisons are used.
Related papers
- Efficient Pointwise-Pairwise Learning-to-Rank for News Recommendation [6.979979613916754]
News recommendation is a challenging task that involves personalization based on the interaction history and preferences of each user.
Recent works have leveraged the power of pretrained language models (PLMs) to directly rank news items by using inference approaches that predominately fall into three categories: pointwise, pairwise, and listwise learning-to-rank.
We propose a novel framework for PLM-based news recommendation that integrates both pointwise relevance prediction and pairwise comparisons in a scalable manner.
arXiv Detail & Related papers (2024-09-26T10:27:19Z) - Finetuning LLMs for Comparative Assessment Tasks [9.05771474043499]
We propose a framework for finetuning large language models for comparative assessment.
By training on soft probabilities, our approach improves state-of-the-art performance.
arXiv Detail & Related papers (2024-09-24T11:21:43Z) - Compare without Despair: Reliable Preference Evaluation with Generation Separability [20.50638483427141]
We introduce a measure, separability, which estimates how suitable a test instance is for pairwise preference evaluation.
For a candidate test instance, separability samples multiple generations from a pair of models, and measures how distinguishable the two sets of generations are.
Experiments show that instances with high separability values yield more consistent preference ratings from both human- and auto-raters.
arXiv Detail & Related papers (2024-07-02T01:37:56Z) - Not All Preference Pairs Are Created Equal: A Recipe for Annotation-Efficient Iterative Preference Learning [81.69044784288005]
Iterative preference learning requires online annotated preference labels.
We study strategies to select worth-annotating response pairs for cost-efficient annotation.
arXiv Detail & Related papers (2024-06-25T06:49:16Z) - The Comparative Trap: Pairwise Comparisons Amplifies Biased Preferences of LLM Evaluators [31.520403357740317]
Large language models (LLMs) are increasingly used as evaluators for natural language generation tasks.
LLMs display biased preferences, such as favoring verbosity and authoritative tones.
We introduce PRePair, which integrates pointwise reasoning within a pairwise framework.
arXiv Detail & Related papers (2024-06-18T06:43:04Z) - Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization [105.3612692153615]
A common technique for aligning large language models (LLMs) relies on acquiring human preferences.
We propose a new axis that is based on eliciting preferences jointly over the instruction-response pairs.
We find that joint preferences over instruction and response pairs can significantly enhance the alignment of LLMs.
arXiv Detail & Related papers (2024-03-31T02:05:40Z) - LLM Comparative Assessment: Zero-shot NLG Evaluation through Pairwise
Comparisons using Large Language Models [55.60306377044225]
Large language models (LLMs) have enabled impressive zero-shot capabilities across various natural language tasks.
This paper explores two options for exploiting the emergent abilities of LLMs for zero-shot NLG assessment.
For moderate-sized open-source LLMs, such as FlanT5 and Llama2-chat, comparative assessment is superior to prompt scoring.
arXiv Detail & Related papers (2023-07-15T22:02:12Z) - Adaptive Sampling for Heterogeneous Rank Aggregation from Noisy Pairwise
Comparisons [85.5955376526419]
In rank aggregation problems, users exhibit various accuracy levels when comparing pairs of items.
We propose an elimination-based active sampling strategy, which estimates the ranking of items via noisy pairwise comparisons.
We prove that our algorithm can return the true ranking of items with high probability.
arXiv Detail & Related papers (2021-10-08T13:51:55Z) - Scalable Personalised Item Ranking through Parametric Density Estimation [53.44830012414444]
Learning from implicit feedback is challenging because of the difficult nature of the one-class problem.
Most conventional methods use a pairwise ranking approach and negative samplers to cope with the one-class problem.
We propose a learning-to-rank approach, which achieves convergence speed comparable to the pointwise counterpart.
arXiv Detail & Related papers (2021-05-11T03:38:16Z) - Ranking a set of objects: a graph based least-square approach [70.7866286425868]
We consider the problem of ranking $N$ objects starting from a set of noisy pairwise comparisons provided by a crowd of equal workers.
We propose a class of non-adaptive ranking algorithms that rely on a least-squares intrinsic optimization criterion for the estimation of qualities.
arXiv Detail & Related papers (2020-02-26T16:19:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.