Refined Direct Preference Optimization with Synthetic Data for
Behavioral Alignment of LLMs
- URL: http://arxiv.org/abs/2402.08005v1
- Date: Mon, 12 Feb 2024 19:10:13 GMT
- Title: Refined Direct Preference Optimization with Synthetic Data for
Behavioral Alignment of LLMs
- Authors: V\'ictor Gallego
- Abstract summary: We introduce emphrefined Direct Preference Optimization (rDPO), a method for improving the behavioral alignment of Large Language Models (LLMs) without the need for human-annotated data.
The method involves creating synthetic data using self-critique prompting by a teacher LLM and then utilising a generalized DPO loss function to distil to a student LLM.
The loss function incorporates an additional external reward model to improve the quality of synthetic data, making rDPO robust to potential noise in the synthetic dataset.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce \emph{refined Direct Preference Optimization}
(rDPO), a method for improving the behavioral alignment of Large Language
Models (LLMs) without the need for human-annotated data. The method involves
creating synthetic data using self-critique prompting by a teacher LLM and then
utilising a generalized DPO loss function to distil to a student LLM. The loss
function incorporates an additional external reward model to improve the
quality of synthetic data, making rDPO robust to potential noise in the
synthetic dataset. rDPO is shown to be effective in a diverse set of
behavioural alignment tasks, such as improved safety, robustness against
role-playing, and reduced sycophancy. Code to be released at
https://github.com/vicgalle/refined-dpo.
Related papers
- Data-adaptive Differentially Private Prompt Synthesis for In-Context Learning [16.04405606517753]
Large Language Models (LLMs) rely on the contextual information embedded in examples/demonstrations to perform in-context learning (ICL)
We introduce a novel data-adaptive differentially private algorithm called AdaDPSyn to generate synthetic examples from a private dataset.
AdaDPSyn adaptively adjusts the noise level in the data synthesis mechanism according to the inherent statistical properties of the data.
arXiv Detail & Related papers (2024-10-15T22:06:30Z) - Self-Boosting Large Language Models with Synthetic Preference Data [97.94185115047999]
We introduce SynPO, a self-boosting paradigm that leverages synthetic preference data for model alignment.
After four SynPO iterations, Llama3-8B and Mistral-7B show significant enhancements in instruction-following abilities.
SynPO improves the general performance of LLMs on various tasks, validated by a 3.2 to 5.0 average score increase on the well-recognized Open LLM leaderboard.
arXiv Detail & Related papers (2024-10-09T14:57:31Z) - Minor DPO reject penalty to increase training robustness [8.971332948872185]
Learning from human preference is a paradigm used in large-scale language model (LLM) fine-tuning step to better align pretrained LLM to human preference for downstream task.
Recently, Direct Preference Optimization (DPO) has been proposed to solve the alignment problem with a simplified RL-free method.
In this article, we analyze the working mechanism of $beta$ in DPO, disclose its syntax difference between RL algorithm and DPO, and understand the potential shortage brought by the DPO simplification.
arXiv Detail & Related papers (2024-08-19T09:29:31Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - P-TA: Using Proximal Policy Optimization to Enhance Tabular Data Augmentation via Large Language Models [15.969452637480167]
We propose using proximal policy optimization (PPO) to apply Generative Adversarial Networks (GANs)
PPO leads to an approximately 4% improvement in the accuracy of models trained on synthetically generated data over state-of-the-art datasets.
arXiv Detail & Related papers (2024-06-17T10:22:00Z) - Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data.
Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Multi-Reference Preference Optimization for Large Language Models [56.84730239046117]
We introduce a novel closed-form formulation for direct preference optimization using multiple reference models.
The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models.
Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance.
arXiv Detail & Related papers (2024-05-26T00:29:04Z) - Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN)
At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself.
This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z) - Adversarial Preference Optimization: Enhancing Your Alignment via RM-LLM Game [31.66896160733569]
We propose an Adversarial Preference Optimization (APO) framework to target more efficient human preference optimization.
We find the proposed adversarial training framework further enhances existing alignment baselines in terms of LLM helpfulness and harmlessness.
arXiv Detail & Related papers (2023-11-14T10:10:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.