Direct Large Language Model Alignment Through Self-Rewarding Contrastive
Prompt Distillation
- URL: http://arxiv.org/abs/2402.11907v1
- Date: Mon, 19 Feb 2024 07:46:40 GMT
- Title: Direct Large Language Model Alignment Through Self-Rewarding Contrastive
Prompt Distillation
- Authors: Aiwei Liu, Haoping Bai, Zhiyun Lu, Xiang Kong, Simon Wang, Jiulong
Shan, Meng Cao, Lijie Wen
- Abstract summary: We propose a method to evaluate the response preference by using the output probabilities of response pairs under contrastive prompt pairs.
Based on this, we propose an automatic alignment method, Direct Large Model Alignment (DLMA)
In the experimental stage, our DLMA method could surpass the textttRLHF method without relying on human-annotated preference data.
- Score: 47.16091219929373
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Aligning large language models (LLMs) with human expectations without
human-annotated preference data is an important problem. In this paper, we
propose a method to evaluate the response preference by using the output
probabilities of response pairs under contrastive prompt pairs, which could
achieve better performance on LLaMA2-7B and LLaMA2-13B compared to RLAIF. Based
on this, we propose an automatic alignment method, Direct Large Model Alignment
(DLMA). First, we use contrastive prompt pairs to automatically generate
preference data. Then, we continue to evaluate the generated preference data
using contrastive prompt pairs and calculate a self-rewarding score. Finally,
we use the DPO algorithm to effectively align LLMs by combining this
self-rewarding score. In the experimental stage, our DLMA method could surpass
the \texttt{RLHF} method without relying on human-annotated preference data.
Related papers
- Aligning Large Language Models with Self-generated Preference Data [72.99676237703099]
We propose a new framework that boosts the alignment of large language models (LLMs) with human preferences.
Our key idea is leveraging the human prior knowledge within the small (seed) data.
We introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data.
arXiv Detail & Related papers (2024-06-06T18:01:02Z) - Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data.
Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z) - Offline Regularised Reinforcement Learning for Large Language Models Alignment [33.483481840098925]
We propose DRO, or emphDirect Reward optimisation, as a framework and associated algorithms.
DRO uses a simple mean-squared objective that can be implemented in various ways.
arXiv Detail & Related papers (2024-05-29T14:11:29Z) - Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization [105.3612692153615]
A common technique for aligning large language models (LLMs) relies on acquiring human preferences.
We propose a new axis that is based on eliciting preferences jointly over the instruction-response pairs.
We find that joint preferences over instruction and response pairs can significantly enhance the alignment of LLMs.
arXiv Detail & Related papers (2024-03-31T02:05:40Z) - Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization [25.290462963681257]
Multimodal Large Language Models (MLLMs) excel in generating responses based on visual inputs.
They often suffer from a bias towards generating responses similar to their pretraining corpus, overshadowing the importance of visual information.
We treat this bias as a "preference" for pretraining statistics, which hinders the model's grounding in visual input.
arXiv Detail & Related papers (2024-03-13T17:29:45Z) - CURATRON: Complete Robust Preference Data for Robust Alignment of Large
Language Models [1.7849982327883962]
This paper addresses the challenges of aligning large language models (LLMs) with human values via preference learning (PL)
We propose a novel method for curation robustly and completely recalibrating values within these datasets.
Our algorithms handle adversarial noise and unobserved comparisons well in both general and preference dataset settings.
arXiv Detail & Related papers (2024-03-05T07:58:12Z) - Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts [95.09994361995389]
Relative Preference Optimization (RPO) is designed to discern between more and less preferred responses derived from both identical and related prompts.
RPO has demonstrated a superior ability to align large language models with user preferences and to improve their adaptability during the training process.
arXiv Detail & Related papers (2024-02-12T22:47:57Z) - Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback [70.32795295142648]
Linear alignment is a novel algorithm that aligns language models with human preferences in one single inference step.
Experiments on both general and personalized preference datasets demonstrate that linear alignment significantly enhances the performance and efficiency of LLM alignment.
arXiv Detail & Related papers (2024-01-21T10:46:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.