Preference Ranking Optimization for Human Alignment
- URL: http://arxiv.org/abs/2306.17492v2
- Date: Tue, 27 Feb 2024 18:42:42 GMT
- Title: Preference Ranking Optimization for Human Alignment
- Authors: Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei Huang, Yongbin Li
and Houfeng Wang
- Abstract summary: Large language models (LLMs) often contain misleading content, emphasizing the need to align them with human values.
Reinforcement learning from human feedback (RLHF) has been employed to achieve this alignment.
We propose Preference Ranking Optimization (PRO) as an efficient SFT algorithm to fine-tune LLMs for human alignment.
- Score: 90.6952059194946
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) often contain misleading content, emphasizing
the need to align them with human values to ensure secure AI systems.
Reinforcement learning from human feedback (RLHF) has been employed to achieve
this alignment. However, it encompasses two main drawbacks: (1) RLHF exhibits
complexity, instability, and sensitivity to hyperparameters in contrast to SFT.
(2) Despite massive trial-and-error, multiple sampling is reduced to pair-wise
contrast, thus lacking contrasts from a macro perspective. In this paper, we
propose Preference Ranking Optimization (PRO) as an efficient SFT algorithm to
directly fine-tune LLMs for human alignment. PRO extends the pair-wise contrast
to accommodate preference rankings of any length. By iteratively contrasting
candidates, PRO instructs the LLM to prioritize the best response while
progressively ranking the rest responses. In this manner, PRO effectively
transforms human alignment into aligning the probability ranking of n responses
generated by LLM with the preference ranking of humans towards these responses.
Experiments have shown that PRO outperforms baseline algorithms, achieving
comparable results to ChatGPT and human responses through automatic-based,
reward-based, GPT-4, and human evaluations.
Related papers
- Adaptive Preference Scaling for Reinforcement Learning with Human Feedback [103.36048042664768]
Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values.
We propose a novel adaptive preference loss, underpinned by distributionally robust optimization (DRO)
Our method is versatile and can be readily adapted to various preference optimization frameworks.
arXiv Detail & Related papers (2024-06-04T20:33:22Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - LIRE: listwise reward enhancement for preference alignment [27.50204023448716]
We propose a gradient-based reward optimization approach that incorporates the offline rewards of multiple responses into a streamlined listwise framework.
LIRE is straightforward to implement, requiring minimal parameter tuning, and seamlessly aligns with the pairwise paradigm.
Our experiments demonstrate that LIRE consistently outperforms existing methods across several benchmarks on dialogue and summarization tasks.
arXiv Detail & Related papers (2024-05-22T10:21:50Z) - Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model [3.300814846990438]
Large Language Models (LLMs) have become increasingly popular due to their ability to process and generate natural language.
As they are trained on massive datasets of text, LLMs can inherit harmful biases and produce outputs that are not aligned with human values.
This paper studies two main approaches to LLM alignment: Reinforcement Learning with Human Feedback (RLHF) and contrastive learning-based methods like Direct Preference Optimization (DPO)
By analyzing the stability and robustness of RLHF and DPO, we propose MPO, a novel method that mitigates the weaknesses of both approaches.
arXiv Detail & Related papers (2024-03-28T14:15:10Z) - Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts [95.09994361995389]
Relative Preference Optimization (RPO) is designed to discern between more and less preferred responses derived from both identical and related prompts.
RPO has demonstrated a superior ability to align large language models with user preferences and to improve their adaptability during the training process.
arXiv Detail & Related papers (2024-02-12T22:47:57Z) - Beyond Imitation: Leveraging Fine-grained Quality Signals for Alignment [105.34140537748546]
We propose an improved alignment approach named FIGA. Different from prior methods, we incorporate fine-grained quality signals that are derived by contrasting good and bad responses.
Our approach has made two major contributions. Firstly, we curate a refined alignment dataset that pairs initial responses and the corresponding revised ones.
Secondly, we devise a new loss function can leverage fine-grained quality signals to instruct the learning of LLMs for alignment.
arXiv Detail & Related papers (2023-11-07T15:36:40Z) - Contrastive Preference Learning: Learning from Human Feedback without RL [71.77024922527642]
We introduce Contrastive Preference Learning (CPL), an algorithm for learning optimal policies from preferences without learning reward functions.
CPL is fully off-policy, uses only a simple contrastive objective, and can be applied to arbitrary MDPs.
arXiv Detail & Related papers (2023-10-20T16:37:56Z) - RLAIF: Scaling Reinforcement Learning from Human Feedback with AI
Feedback [5.469395454378616]
Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences.
RL from AI Feedback (RLAIF) offers a promising alternative that leverages a powerful off-the-shelf LLM to generate preferences in lieu of human annotators.
Our results suggest that RLAIF can achieve human-level performance, offering a potential solution to the scalability limitations of RLHF.
arXiv Detail & Related papers (2023-09-01T05:53:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.