Related papers: Preference Ranking Optimization for Human Alignment

Preference Ranking Optimization for Human Alignment

URL: http://arxiv.org/abs/2306.17492v2
Date: Tue, 27 Feb 2024 18:42:42 GMT
Title: Preference Ranking Optimization for Human Alignment
Authors: Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei Huang, Yongbin Li and Houfeng Wang
Abstract summary: Large language models (LLMs) often contain misleading content, emphasizing the need to align them with human values. Reinforcement learning from human feedback (RLHF) has been employed to achieve this alignment. We propose Preference Ranking Optimization (PRO) as an efficient SFT algorithm to fine-tune LLMs for human alignment.
Score: 90.6952059194946
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) often contain misleading content, emphasizing the need to align them with human values to ensure secure AI systems. Reinforcement learning from human feedback (RLHF) has been employed to achieve this alignment. However, it encompasses two main drawbacks: (1) RLHF exhibits complexity, instability, and sensitivity to hyperparameters in contrast to SFT. (2) Despite massive trial-and-error, multiple sampling is reduced to pair-wise contrast, thus lacking contrasts from a macro perspective. In this paper, we propose Preference Ranking Optimization (PRO) as an efficient SFT algorithm to directly fine-tune LLMs for human alignment. PRO extends the pair-wise contrast to accommodate preference rankings of any length. By iteratively contrasting candidates, PRO instructs the LLM to prioritize the best response while progressively ranking the rest responses. In this manner, PRO effectively transforms human alignment into aligning the probability ranking of n responses generated by LLM with the preference ranking of humans towards these responses. Experiments have shown that PRO outperforms baseline algorithms, achieving comparable results to ChatGPT and human responses through automatic-based, reward-based, GPT-4, and human evaluations.

Related papers

A Systematic Examination of Preference Learning through the Lens of Instruction-Following [83.71180850955679]
We use a novel synthetic data generation pipeline to generate 48,000 instruction unique-following prompts. With our synthetic prompts, we use two preference dataset curation methods - rejection sampling (RS) and Monte Carlo Tree Search (MCTS) Experiments reveal that shared prefixes in preference pairs, as generated by MCTS, provide marginal but consistent improvements. High-contrast preference pairs generally outperform low-contrast pairs; however, combining both often yields the best performance.
arXiv Detail & Related papers (2024-12-18T15:38:39Z)
Constrain Alignment with Sparse Autoencoders [45.131670081186]
Feature-level constrained Preference Optimization is a novel method designed to simplify the alignment process while ensuring stability.<n>Our approach enjoys efficiency by using sparse features activated in a well-trained sparse autoencoder and the quality of sequential KL divergence.
arXiv Detail & Related papers (2024-11-12T07:54:13Z)
Permutative Preference Alignment from Listwise Ranking of Human Judgments [40.23480751285947]
We develop an end-to-end alignment algorithm by approximating NDCG with a differentiable surrogate loss.<n>We show that NDCG-based approaches improve ranking accuracy more effectively than B-T-based methods.
arXiv Detail & Related papers (2024-10-06T03:49:28Z)
Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness [27.43137305486112]
We propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss. The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods to achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-09-26T12:37:26Z)
REAL: Response Embedding-based Alignment for LLMs [1.9513983244114355]
We propose a strategy for sampling a high-quality training dataset that focuses on acquiring the most informative response pairs. Experimental results indicate that choosing dissimilar response pairs enhances the direct alignment of LLMs. Our findings suggest that focusing on less similar pairs can improve the efficiency of LLM alignment, saving up to 65% of annotators' work.
arXiv Detail & Related papers (2024-09-17T22:40:54Z)
Reward Difference Optimization For Sample Reweighting In Offline RLHF [18.62836654699957]
Current offline RLHF only captures the "ordinal relationship" between responses, overlooking the crucial aspect of how much one is preferred over the others. We propose a simple yet effective solution called Reward Difference Optimization, shorted as RDO. Experiments with 7B LLMs on the HH and TL;DR datasets substantiate the effectiveness of our method in both automatic metrics and human evaluation.
arXiv Detail & Related papers (2024-08-18T07:04:16Z)
Adaptive Preference Scaling for Reinforcement Learning with Human Feedback [103.36048042664768]
Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values. We propose a novel adaptive preference loss, underpinned by distributionally robust optimization (DRO) Our method is versatile and can be readily adapted to various preference optimization frameworks.
arXiv Detail & Related papers (2024-06-04T20:33:22Z)
LIRE: listwise reward enhancement for preference alignment [27.50204023448716]
We propose a gradient-based reward optimization approach that incorporates the offline rewards of multiple responses into a streamlined listwise framework. LIRE is straightforward to implement, requiring minimal parameter tuning, and seamlessly aligns with the pairwise paradigm. Our experiments demonstrate that LIRE consistently outperforms existing methods across several benchmarks on dialogue and summarization tasks.
arXiv Detail & Related papers (2024-05-22T10:21:50Z)
Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization [105.3612692153615]
A common technique for aligning large language models (LLMs) relies on acquiring human preferences. We propose a new axis that is based on eliciting preferences jointly over the instruction-response pairs. We find that joint preferences over instruction and response pairs can significantly enhance the alignment of LLMs.
arXiv Detail & Related papers (2024-03-31T02:05:40Z)
Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model [3.300814846990438]
Large Language Models (LLMs) have become increasingly popular due to their ability to process and generate natural language. As they are trained on massive datasets of text, LLMs can inherit harmful biases and produce outputs that are not aligned with human values. This paper studies two main approaches to LLM alignment: Reinforcement Learning with Human Feedback (RLHF) and contrastive learning-based methods like Direct Preference Optimization (DPO) By analyzing the stability and robustness of RLHF and DPO, we propose MPO, a novel method that mitigates the weaknesses of both approaches.
arXiv Detail & Related papers (2024-03-28T14:15:10Z)
Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts [95.09994361995389]
Relative Preference Optimization (RPO) is designed to discern between more and less preferred responses derived from both identical and related prompts. RPO has demonstrated a superior ability to align large language models with user preferences and to improve their adaptability during the training process.
arXiv Detail & Related papers (2024-02-12T22:47:57Z)
Beyond Imitation: Leveraging Fine-grained Quality Signals for Alignment [105.34140537748546]
We propose an improved alignment approach named FIGA. Different from prior methods, we incorporate fine-grained quality signals that are derived by contrasting good and bad responses. Our approach has made two major contributions. Firstly, we curate a refined alignment dataset that pairs initial responses and the corresponding revised ones. Secondly, we devise a new loss function can leverage fine-grained quality signals to instruct the learning of LLMs for alignment.
arXiv Detail & Related papers (2023-11-07T15:36:40Z)
Contrastive Preference Learning: Learning from Human Feedback without RL [71.77024922527642]
We introduce Contrastive Preference Learning (CPL), an algorithm for learning optimal policies from preferences without learning reward functions. CPL is fully off-policy, uses only a simple contrastive objective, and can be applied to arbitrary MDPs.
arXiv Detail & Related papers (2023-10-20T16:37:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.