Related papers: Aligning Large Language Models with Self-generated Preference Data

Aligning Large Language Models with Self-generated Preference Data

URL: http://arxiv.org/abs/2406.04412v1
Date: Thu, 6 Jun 2024 18:01:02 GMT
Title: Aligning Large Language Models with Self-generated Preference Data
Authors: Dongyoung Kim, Kimin Lee, Jinwoo Shin, Jaehyung Kim,
Abstract summary: We propose a new framework that boosts the alignment of large language models (LLMs) with human preferences. Our key idea is leveraging the human prior knowledge within the small (seed) data. We introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data.
Score: 72.99676237703099
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Aligning large language models (LLMs) with human preferences becomes a key component to obtaining state-of-the-art performance, but it yields a huge cost to construct a large human-annotated preference dataset. To tackle this problem, we propose a new framework that boosts the alignment of LLMs through Self-generated Preference data (Selfie) using only a very small amount of human-annotated preference data. Our key idea is leveraging the human prior knowledge within the small (seed) data and progressively improving the alignment of LLM, by iteratively generating the responses and learning from them with the self-annotated preference data. To be specific, we propose to derive the preference label from the logits of LLM to explicitly extract the model's inherent preference. Compared to the previous approaches using external reward models or implicit in-context learning, we observe that the proposed approach is significantly more effective. In addition, we introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data. Our experimental results demonstrate that the proposed framework significantly boosts the alignment of LLMs. For example, we achieve superior alignment performance on AlpacaEval 2.0 with only 3.3\% of the ground-truth preference labels in the Ultrafeedback data compared to the cases using the entire data or state-of-the-art baselines.

Related papers

Intuitionistic Fuzzy Sets for Large Language Model Data Annotation: A Novel Approach to Side-by-Side Preference Labeling [0.0]
This paper introduces a novel framework based on intuitionistic fuzzy sets (IFS) for modeling and aggregating human preferences in large language models (LLMs)<n>Our approach captures not only the degree of preference but also the uncertainty and hesitation inherent in human judgment through membership, non-membership, and hesitation degrees.<n> Experimental validation on multiple datasets demonstrates that our IFS-based approach significantly improves annotation consistency, reduces annotator fatigue, and produces higher-quality preference data.
arXiv Detail & Related papers (2025-05-30T04:20:00Z)
Leveraging Robust Optimization for LLM Alignment under Distribution Shifts [52.983390470606146]
Preference alignment methods are increasingly critical for steering large language models to generate outputs consistent with human values.<n>We propose a novel distribution-aware optimization framework that improves preference alignment despite such shifts.
arXiv Detail & Related papers (2025-04-08T09:14:38Z)
Less is More: Improving LLM Alignment via Preference Data Selection [46.9163802899686]
Direct Preference Optimization (DPO) has emerged as a promising approach for aligning large language models with human preferences.<n>We propose a novel margin-maximization principle for dataset curation in DPO training.<n>We show that our approach seamlessly extends to iterative DPO, yielding a roughly 3% improvement with 25% online data.
arXiv Detail & Related papers (2025-02-20T13:45:17Z)
Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback [87.37721254914476]
We introduce a routing framework that combines inputs from humans and LMs to achieve better annotation quality. We train a performance prediction model to predict a reward model's performance on an arbitrary combination of human and LM annotations. We show that the selected hybrid mixture achieves better reward model performance compared to using either one exclusively.
arXiv Detail & Related papers (2024-10-24T20:04:15Z)
RosePO: Aligning LLM-based Recommenders with Human Values [38.029251417802044]
We propose a general framework -- Recommendation with smoothing personalized Preference Optimization (RosePO) RosePO better aligns with customized human values during the post-training stage. Evaluation on three real-world datasets demonstrates the effectiveness of our method.
arXiv Detail & Related papers (2024-10-16T12:54:34Z)
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [56.24431208419858]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset. We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset.
arXiv Detail & Related papers (2024-10-10T16:01:51Z)
Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness [27.43137305486112]
We propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss. The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods to achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-09-26T12:37:26Z)
Geometric-Averaged Preference Optimization for Soft Preference Labels [78.2746007085333]
Many algorithms for aligning LLMs with human preferences assume that human preferences are binary and deterministic. In this work, we introduce the distributional soft preference labels and improve Direct Preference Optimization (DPO) with a weighted geometric average of the LLM output likelihood in the loss function.
arXiv Detail & Related papers (2024-09-10T17:54:28Z)
Exposing Privacy Gaps: Membership Inference Attack on Preference Data for LLM Alignment [8.028743532294532]
We introduce a novel reference-based attack framework specifically for analyzing preference data called PREMIA. We provide empirical evidence that DPO models are more vulnerable to MIA compared to PPO models.
arXiv Detail & Related papers (2024-07-08T22:53:23Z)
Aligning Large Language Models via Fine-grained Supervision [20.35000061196631]
Pre-trained large-scale language models (LLMs) excel at producing coherent articles, yet their outputs may be untruthful, toxic, or fail to align with user expectations. Current approaches focus on using reinforcement learning with human feedback to improve model alignment. We propose a method to enhance LLM alignment through fine-grained token-level supervision.
arXiv Detail & Related papers (2024-06-04T20:21:45Z)
Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data. Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z)
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions. Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z)
Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization [105.3612692153615]
A common technique for aligning large language models (LLMs) relies on acquiring human preferences. We propose a new axis that is based on eliciting preferences jointly over the instruction-response pairs. We find that joint preferences over instruction and response pairs can significantly enhance the alignment of LLMs.
arXiv Detail & Related papers (2024-03-31T02:05:40Z)
Direct Large Language Model Alignment Through Self-Rewarding Contrastive Prompt Distillation [45.21355506181213]
We propose a method to evaluate the response preference by using the output probabilities of response pairs under contrastive prompt pairs. Based on this, we propose an automatic alignment method, Direct Large Model Alignment (DLMA) In the experimental stage, our DLMA method could surpass the textttRLHF method without relying on human-annotated preference data.
arXiv Detail & Related papers (2024-02-19T07:46:40Z)
Dissecting Human and LLM Preferences [80.55271307662365]
We find that humans are less sensitive to errors, favor responses that support their stances, and show clear dislike when models admit their limits. advanced LLMs like GPT-4-Turbo emphasize correctness, clarity, and harmlessness more. We show that preference-based evaluation can be intentionally manipulated.
arXiv Detail & Related papers (2024-02-17T14:34:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.