Aligning Large Language Models with Self-generated Preference Data
- URL: http://arxiv.org/abs/2406.04412v1
- Date: Thu, 6 Jun 2024 18:01:02 GMT
- Title: Aligning Large Language Models with Self-generated Preference Data
- Authors: Dongyoung Kim, Kimin Lee, Jinwoo Shin, Jaehyung Kim,
- Abstract summary: We propose a new framework that boosts the alignment of large language models (LLMs) with human preferences.
Our key idea is leveraging the human prior knowledge within the small (seed) data.
We introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data.
- Score: 72.99676237703099
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Aligning large language models (LLMs) with human preferences becomes a key component to obtaining state-of-the-art performance, but it yields a huge cost to construct a large human-annotated preference dataset. To tackle this problem, we propose a new framework that boosts the alignment of LLMs through Self-generated Preference data (Selfie) using only a very small amount of human-annotated preference data. Our key idea is leveraging the human prior knowledge within the small (seed) data and progressively improving the alignment of LLM, by iteratively generating the responses and learning from them with the self-annotated preference data. To be specific, we propose to derive the preference label from the logits of LLM to explicitly extract the model's inherent preference. Compared to the previous approaches using external reward models or implicit in-context learning, we observe that the proposed approach is significantly more effective. In addition, we introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data. Our experimental results demonstrate that the proposed framework significantly boosts the alignment of LLMs. For example, we achieve superior alignment performance on AlpacaEval 2.0 with only 3.3\% of the ground-truth preference labels in the Ultrafeedback data compared to the cases using the entire data or state-of-the-art baselines.
Related papers
- Exposing Privacy Gaps: Membership Inference Attack on Preference Data for LLM Alignment [8.028743532294532]
We introduce a novel reference-based attack framework specifically for analyzing preference data called PREMIA.
We provide empirical evidence that DPO models are more vulnerable to MIA compared to PPO models.
arXiv Detail & Related papers (2024-07-08T22:53:23Z) - Aligning Large Language Models via Fine-grained Supervision [20.35000061196631]
Pre-trained large-scale language models (LLMs) excel at producing coherent articles, yet their outputs may be untruthful, toxic, or fail to align with user expectations.
Current approaches focus on using reinforcement learning with human feedback to improve model alignment.
We propose a method to enhance LLM alignment through fine-grained token-level supervision.
arXiv Detail & Related papers (2024-06-04T20:21:45Z) - Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data.
Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z) - Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [90.4820014819937]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions.
Our experimental results demonstrate that when finetuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, SELM significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z) - Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization [105.3612692153615]
A common technique for aligning large language models (LLMs) relies on acquiring human preferences.
We propose a new axis that is based on eliciting preferences jointly over the instruction-response pairs.
We find that joint preferences over instruction and response pairs can significantly enhance the alignment of LLMs.
arXiv Detail & Related papers (2024-03-31T02:05:40Z) - CURATRON: Complete Robust Preference Data for Robust Alignment of Large
Language Models [1.7849982327883962]
This paper addresses the challenges of aligning large language models (LLMs) with human values via preference learning (PL)
We propose a novel method for curation robustly and completely recalibrating values within these datasets.
Our algorithms handle adversarial noise and unobserved comparisons well in both general and preference dataset settings.
arXiv Detail & Related papers (2024-03-05T07:58:12Z) - Direct Large Language Model Alignment Through Self-Rewarding Contrastive
Prompt Distillation [47.16091219929373]
We propose a method to evaluate the response preference by using the output probabilities of response pairs under contrastive prompt pairs.
Based on this, we propose an automatic alignment method, Direct Large Model Alignment (DLMA)
In the experimental stage, our DLMA method could surpass the textttRLHF method without relying on human-annotated preference data.
arXiv Detail & Related papers (2024-02-19T07:46:40Z) - Dissecting Human and LLM Preferences [80.55271307662365]
We find that humans are less sensitive to errors, favor responses that support their stances, and show clear dislike when models admit their limits.
advanced LLMs like GPT-4-Turbo emphasize correctness, clarity, and harmlessness more.
We show that preference-based evaluation can be intentionally manipulated.
arXiv Detail & Related papers (2024-02-17T14:34:31Z) - Self-Supervised Dataset Distillation for Transfer Learning [77.4714995131992]
We propose a novel problem of distilling an unlabeled dataset into a set of small synthetic samples for efficient self-supervised learning (SSL)
We first prove that a gradient of synthetic samples with respect to a SSL objective in naive bilevel optimization is textitbiased due to randomness originating from data augmentations or masking.
We empirically validate the effectiveness of our method on various applications involving transfer learning.
arXiv Detail & Related papers (2023-10-10T10:48:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.