Hummer: Towards Limited Competitive Preference Dataset
- URL: http://arxiv.org/abs/2405.11647v3
- Date: Tue, 6 Aug 2024 14:12:26 GMT
- Title: Hummer: Towards Limited Competitive Preference Dataset
- Authors: Li Jiang, Yusen Wu, Junwu Xiong, Jingqing Ruan, Yichuan Ding, Qingpei Guo, Zujie Wen, Jun Zhou, Xiaotie Deng,
- Abstract summary: We introduce a novel metric, Alignment Dimension Conflict, to quantify the degree of conflict within preference datasets.
We present textttHummer and its fine-grained variant, textttHummer-F, as innovative pairwise preference datasets with reduced-conflict alignment objectives.
- Score: 19.03597445162459
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Preference datasets are essential for incorporating human preferences into pre-trained language models, playing a key role in the success of Reinforcement Learning from Human Feedback. However, these datasets often demonstrate conflicting alignment objectives, leading to increased vulnerability to jailbreak attacks and challenges in adapting downstream tasks to prioritize specific alignment objectives without negatively impacting others. In this work, we introduce a novel statistical metric, Alignment Dimension Conflict, to quantify the degree of conflict within preference datasets. We then present \texttt{Hummer} and its fine-grained variant, \texttt{Hummer-F}, as innovative pairwise preference datasets with reduced-conflict alignment objectives. \texttt{Hummer} is built based on UltraFeedback and is enhanced by AI feedback from GPT-4, marking as the first preference dataset aimed at reducing the competition between alignment objectives. Furthermore, we develop reward models, HummerRM and HummerRM-F, which employ a hybrid sampling approach to balance diverse alignment objectives effectively. This sampling method positions HummerRM as an ideal model for domain-specific further fine-tuning and reducing vulnerabilities to attacks.
Related papers
- SeRA: Self-Reviewing and Alignment of Large Language Models using Implicit Reward Margins [30.767203592231496]
Self-Reviewing and Alignment (SeRA) is a cost-efficient and effective method that can be readily combined with existing DAAs.
SeRA comprises of two components: (1) sample selection using implicit reward margins, which helps alleviate over-fitting to some undesired features, and (2) preference bootstrapping using implicit rewards to augment preference data with updated policy models.
arXiv Detail & Related papers (2024-10-12T04:17:28Z) - Just Say What You Want: Only-prompting Self-rewarding Online Preference Optimization [64.34767799614328]
Current self-rewarding approaches rely heavily on the discriminator's judgment capabilities.
We propose a novel, only-prompting self-rewarding online algorithm that generates preference datasets without relying on judgment capabilities.
arXiv Detail & Related papers (2024-09-26T04:41:08Z) - SEAL: Systematic Error Analysis for Value ALignment [4.2185937778110825]
Reinforcement Learning from Human Feedback aims to align language models with human values.
Despite its importance, the internal mechanisms of RLHF remain poorly understood.
This paper introduces new metrics to evaluate the effectiveness of modeling and aligning human values.
arXiv Detail & Related papers (2024-08-16T18:48:30Z) - Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.
We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.
We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z) - Joint Demonstration and Preference Learning Improves Policy Alignment with Human Feedback [58.049113055986375]
We develop a single stage approach named Alignment with Integrated Human Feedback (AIHF) to train reward models and the policy.
The proposed approach admits a suite of efficient algorithms, which can easily reduce to, and leverage, popular alignment algorithms.
We demonstrate the efficiency of the proposed solutions with extensive experiments involving alignment problems in LLMs and robotic control problems in MuJoCo.
arXiv Detail & Related papers (2024-06-11T01:20:53Z) - Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment [103.12563033438715]
Alignment in artificial intelligence pursues consistency between model responses and human preferences as well as values.
Existing alignment techniques are mostly unidirectional, leading to suboptimal trade-offs and poor flexibility over various objectives.
We introduce controllable preference optimization (CPO), which explicitly specifies preference scores for different objectives.
arXiv Detail & Related papers (2024-02-29T12:12:30Z) - InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling [66.3072381478251]
Reward hacking, also termed reward overoptimization, remains a critical challenge.
We propose a framework for reward modeling, namely InfoRM, by introducing a variational information bottleneck objective.
We show that InfoRM's overoptimization detection mechanism is not only effective but also robust across a broad range of datasets.
arXiv Detail & Related papers (2024-02-14T17:49:07Z) - Uncertainty-Guided Alignment for Unsupervised Domain Adaptation in
Regression [5.939858158928473]
Unsupervised Domain Adaptation for Regression aims to adapt a model from a labeled source domain to an unlabeled target domain for regression tasks.
Recent successful works in UDAR mostly focus on subspace alignment, involving the alignment of a selected subspace within the entire feature space.
We propose an effective method for UDAR by incorporating guidance from uncertainty.
arXiv Detail & Related papers (2024-01-24T14:55:02Z) - MAPS: A Noise-Robust Progressive Learning Approach for Source-Free
Domain Adaptive Keypoint Detection [76.97324120775475]
Cross-domain keypoint detection methods always require accessing the source data during adaptation.
This paper considers source-free domain adaptive keypoint detection, where only the well-trained source model is provided to the target domain.
arXiv Detail & Related papers (2023-02-09T12:06:08Z) - Reproducibility-Oriented and Privacy-Preserving Genomic Dataset Sharing [8.959228247984337]
We propose an innovative method that involves a differential privacy-based scheme for sharing genomic datasets.
We show that our proposed scheme outperforms all other methods in detecting GWAS outcome errors, achieves better utility, and provides higher privacy protection against membership inference attacks (MIAs)
By utilizing our method, genomic researchers will be inclined to share a differentially private, yet of high quality version of their datasets.
arXiv Detail & Related papers (2022-09-13T22:20:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.