Direct Large Language Model Alignment Through Self-Rewarding Contrastive Prompt Distillation
- URL: http://arxiv.org/abs/2402.11907v2
- Date: Thu, 15 Aug 2024 17:37:36 GMT
- Title: Direct Large Language Model Alignment Through Self-Rewarding Contrastive Prompt Distillation
- Authors: Aiwei Liu, Haoping Bai, Zhiyun Lu, Xiang Kong, Simon Wang, Jiulong Shan, Meng Cao, Lijie Wen,
- Abstract summary: We propose a method to evaluate the response preference by using the output probabilities of response pairs under contrastive prompt pairs.
Based on this, we propose an automatic alignment method, Direct Large Model Alignment (DLMA)
In the experimental stage, our DLMA method could surpass the textttRLHF method without relying on human-annotated preference data.
- Score: 45.21355506181213
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Aligning large language models (LLMs) with human expectations without human-annotated preference data is an important problem. In this paper, we propose a method to evaluate the response preference by using the output probabilities of response pairs under contrastive prompt pairs, which could achieve better performance on LLaMA2-7B and LLaMA2-13B compared to RLAIF. Based on this, we propose an automatic alignment method, Direct Large Model Alignment (DLMA). First, we use contrastive prompt pairs to automatically generate preference data. Then, we continue to evaluate the generated preference data using contrastive prompt pairs and calculate a self-rewarding score. Finally, we use the DPO algorithm to effectively align LLMs by combining this self-rewarding score. In the experimental stage, our DLMA method could surpass the \texttt{RLHF} method without relying on human-annotated preference data.
Related papers
- Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback [87.37721254914476]
We introduce a routing framework that combines inputs from humans and LMs to achieve better annotation quality.
We train a performance prediction model to predict a reward model's performance on an arbitrary combination of human and LM annotations.
We show that the selected hybrid mixture achieves better reward model performance compared to using either one exclusively.
arXiv Detail & Related papers (2024-10-24T20:04:15Z) - REAL: Response Embedding-based Alignment for LLMs [1.9513983244114355]
We propose a strategy for sampling a high-quality training dataset that focuses on acquiring the most informative response pairs.
Experimental results indicate that choosing dissimilar response pairs enhances the direct alignment of LLMs.
Our findings suggest that focusing on less similar pairs can improve the efficiency of LLM alignment, saving up to 65% of annotators' work.
arXiv Detail & Related papers (2024-09-17T22:40:54Z) - Aligning Large Language Models with Self-generated Preference Data [72.99676237703099]
We propose a new framework that boosts the alignment of large language models (LLMs) with human preferences.
Our key idea is leveraging the human prior knowledge within the small (seed) data.
We introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data.
arXiv Detail & Related papers (2024-06-06T18:01:02Z) - Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data.
Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z) - Offline Regularised Reinforcement Learning for Large Language Models Alignment [33.483481840098925]
We propose DRO, or emphDirect Reward optimisation, as a framework and associated algorithms.
DRO uses a simple mean-squared objective that can be implemented in various ways.
arXiv Detail & Related papers (2024-05-29T14:11:29Z) - Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization [105.3612692153615]
A common technique for aligning large language models (LLMs) relies on acquiring human preferences.
We propose a new axis that is based on eliciting preferences jointly over the instruction-response pairs.
We find that joint preferences over instruction and response pairs can significantly enhance the alignment of LLMs.
arXiv Detail & Related papers (2024-03-31T02:05:40Z) - Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization [25.290462963681257]
Multimodal Large Language Models (MLLMs) excel in generating responses based on visual inputs.
They often suffer from a bias towards generating responses similar to their pretraining corpus, overshadowing the importance of visual information.
We treat this bias as a "preference" for pretraining statistics, which hinders the model's grounding in visual input.
arXiv Detail & Related papers (2024-03-13T17:29:45Z) - Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts [95.09994361995389]
Relative Preference Optimization (RPO) is designed to discern between more and less preferred responses derived from both identical and related prompts.
RPO has demonstrated a superior ability to align large language models with user preferences and to improve their adaptability during the training process.
arXiv Detail & Related papers (2024-02-12T22:47:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.