Related papers: LiPO: Listwise Preference Optimization through Learning-to-Rank

LiPO: Listwise Preference Optimization through Learning-to-Rank

URL: http://arxiv.org/abs/2402.01878v2
Date: Wed, 22 May 2024 18:51:02 GMT
Title: LiPO: Listwise Preference Optimization through Learning-to-Rank
Authors: Tianqi Liu, Zhen Qin, Junru Wu, Jiaming Shen, Misha Khalman, Rishabh Joshi, Yao Zhao, Mohammad Saleh, Simon Baumgartner, Jialu Liu, Peter J. Liu, Xuanhui Wang,
Abstract summary: Policy can learn more effectively from a ranked list of plausible responses given the prompt. We show that LiPO-$lambda$ can outperform DPO variants and SLiC by a clear margin on several preference alignment tasks.
Score: 62.02782819559389
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Aligning language models (LMs) with curated human feedback is critical to control their behaviors in real-world applications. Several recent policy optimization methods, such as DPO and SLiC, serve as promising alternatives to the traditional Reinforcement Learning from Human Feedback (RLHF) approach. In practice, human feedback often comes in a format of a ranked list over multiple responses to amortize the cost of reading prompt. Multiple responses can also be ranked by reward models or AI feedback. There lacks such a thorough study on directly fitting upon a list of responses. In this work, we formulate the LM alignment as a \textit{listwise} ranking problem and describe the LiPO framework, where the policy can potentially learn more effectively from a ranked list of plausible responses given the prompt. This view draws an explicit connection to Learning-to-Rank (LTR), where most existing preference optimization work can be mapped to existing ranking objectives. Following this connection, we provide an examination of ranking objectives that are not well studied for LM alignment with DPO and SLiC as special cases when list size is two. In particular, we highlight a specific method, LiPO-$\lambda$, which leverages a state-of-the-art \textit{listwise} ranking objective and weights each preference pair in a more advanced manner. We show that LiPO-$\lambda$ can outperform DPO variants and SLiC by a clear margin on several preference alignment tasks with both curated and real rankwise preference data.

Related papers

Rank-GRPO: Training LLM-based Conversational Recommender Systems with Reinforcement Learning [70.6126069527741]
ConvRec-R1 is a two-stage framework for end-to-end training of conversational recommender systems.<n>In Stage 1, we construct a behavioral-cloning dataset with a Remap-Reflect-Adjust pipeline.<n>In Stage 2, we propose Rank-GRPO, a principled extension of group relative policy optimization.
arXiv Detail & Related papers (2025-10-23T02:56:00Z)
Prompt-Based LLMs for Position Bias-Aware Reranking in Personalized Recommendations [0.0]
Large language models (LLMs) have been adopted for prompt-based recommendation.<n>LLMs face limitations such as limited context window size, inefficient pointwise and pairwise prompting, and difficulty handling listwise ranking.<n>We propose a hybrid framework that combines a traditional recommendation model with an LLM for reranking top-k items using structured prompts.
arXiv Detail & Related papers (2025-05-08T05:01:44Z)
In-context Ranking Preference Optimization [48.36442791241395]
We propose an In-context Ranking Preference Optimization (IRPO) framework to optimize large language models (LLMs) based on ranking lists constructed during inference. We show IRPO outperforms standard DPO approaches in ranking performance, highlighting its effectiveness in aligning LLMs with direct in-context ranking preferences.
arXiv Detail & Related papers (2025-04-21T23:06:12Z)
Active Learning for Direct Preference Optimization [59.84525302418018]
Direct preference optimization (DPO) is a form of reinforcement learning from human feedback. We propose an active learning framework for DPO, which can be applied to collect human feedback online or to choose the most informative subset of already collected feedback offline.
arXiv Detail & Related papers (2025-03-03T00:36:31Z)
Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback [40.01227095901647]
Large language models (LLMs) demonstrate impressive performance but lack the flexibility to adapt to human preferences quickly without retraining. We introduce Test-time Preference Optimization (TPO), a framework that aligns LLM outputs with human preferences during inference. Our findings establish TPO as a practical, lightweight alternative for test-time preference optimization, achieving alignment on the fly.
arXiv Detail & Related papers (2025-01-22T14:15:46Z)
Online Preference Alignment for Language Models via Count-based Exploration [46.46627519343809]
Reinforcement Learning from Human Feedback (RLHF) has shown great potential in fine-tuning Large Language Models (LLMs) to align with human preferences. Existing methods perform preference alignment from a fixed dataset, which can be limited in data coverage. Online RLHF is more desirable to empower the LLM to explore outside the support of the initial dataset by iteratively collecting the prompt-response pairs.
arXiv Detail & Related papers (2025-01-22T09:12:09Z)
Self-Calibrated Listwise Reranking with Large Language Models [137.6557607279876]
Large language models (LLMs) have been employed in reranking tasks through a sequence-to-sequence approach. This reranking paradigm requires a sliding window strategy to iteratively handle larger candidate sets. We propose a novel self-calibrated listwise reranking method, which aims to leverage LLMs to produce global relevance scores for ranking.
arXiv Detail & Related papers (2024-11-07T10:31:31Z)
Permutative Preference Alignment from Listwise Ranking of Human Judgments [40.23480751285947]
We develop an end-to-end alignment algorithm by approximating NDCG with a differentiable surrogate loss.<n>We show that NDCG-based approaches improve ranking accuracy more effectively than B-T-based methods.
arXiv Detail & Related papers (2024-10-06T03:49:28Z)
Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness [27.43137305486112]
We propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss. The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods to achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-09-26T12:37:26Z)
Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization [105.3612692153615]
A common technique for aligning large language models (LLMs) relies on acquiring human preferences. We propose a new axis that is based on eliciting preferences jointly over the instruction-response pairs. We find that joint preferences over instruction and response pairs can significantly enhance the alignment of LLMs.
arXiv Detail & Related papers (2024-03-31T02:05:40Z)
Fine-Tuning Language Models with Reward Learning on Policy [68.70065254564642]
Reinforcement learning from human feedback (RLHF) has emerged as an effective approach to aligning large language models (LLMs) to human preferences. Despite its popularity, (fixed) reward models may suffer from inaccurate off-distribution. We propose reward learning on policy (RLP), an unsupervised framework that refines a reward model using policy samples to keep it on-distribution.
arXiv Detail & Related papers (2024-03-28T10:02:10Z)
Make Large Language Model a Better Ranker [20.532118635672763]
This paper introduces the large language model framework with Aligned Listwise Ranking Objectives (ALRO) ALRO is designed to bridge the gap between the capabilities of LLMs and nuanced requirements of ranking tasks. Our evaluative studies reveal that ALRO outperforms both existing embedding-based recommendation methods and LLM-based recommendation baselines.
arXiv Detail & Related papers (2024-03-28T07:22:16Z)
Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts [95.09994361995389]
Relative Preference Optimization (RPO) is designed to discern between more and less preferred responses derived from both identical and related prompts. RPO has demonstrated a superior ability to align large language models with user preferences and to improve their adaptability during the training process.
arXiv Detail & Related papers (2024-02-12T22:47:57Z)
Tuna: Instruction Tuning using Feedback from Large Language Models [74.04950416204551]
We propose finetuning an instruction-tuned large language model using our novel textitprobabilistic ranking and textitcontextual ranking approaches. Probabilistic ranking enables the instruction-tuned model to inherit the relative rankings of high-quality and low-quality responses from the teacher LLM. On the other hand, learning with contextual ranking allows the model to refine its own response distribution using the contextual understanding ability of stronger LLMs.
arXiv Detail & Related papers (2023-10-20T09:55:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.