Negating Negatives: Alignment without Human Positive Samples via
Distributional Dispreference Optimization
- URL: http://arxiv.org/abs/2403.03419v1
- Date: Wed, 6 Mar 2024 03:02:38 GMT
- Title: Negating Negatives: Alignment without Human Positive Samples via
Distributional Dispreference Optimization
- Authors: Shitong Duan, Xiaoyuan Yi, Peng Zhang, Tun Lu, Xing Xie, Ning Gu
- Abstract summary: Large language models (LLMs) have revolutionized the role of AI, yet pose potential risks of propagating unethical content.
This work focuses on achieving alignment using solely human-annotated negative samples.
- Score: 36.66806788879868
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have revolutionized the role of AI, yet also
pose potential risks of propagating unethical content. Alignment technologies
have been introduced to steer LLMs towards human preference, gaining increasing
attention. Despite notable breakthroughs in this direction, existing methods
heavily rely on high-quality positive-negative training pairs, suffering from
noisy labels and the marginal distinction between preferred and dispreferred
response data. Given recent LLMs' proficiency in generating helpful responses,
this work pivots towards a new research focus: achieving alignment using solely
human-annotated negative samples, preserving helpfulness while reducing
harmfulness. For this purpose, we propose Distributional Dispreference
Optimization (D$^2$O), which maximizes the discrepancy between the generated
responses and the dispreferred ones to effectively eschew harmful information.
We theoretically demonstrate that D$^2$O is equivalent to learning a
distributional instead of instance-level preference model reflecting human
dispreference against the distribution of negative responses. Besides, D$^2$O
integrates an implicit Jeffrey Divergence regularization to balance the
exploitation and exploration of reference policies and converges to a
non-negative one during training. Extensive experiments demonstrate that our
method achieves comparable generation quality and surpasses the latest
baselines in producing less harmful and more informative responses with better
training stability and faster convergence.
Related papers
- Negative-Prompt-driven Alignment for Generative Language Model [34.191590966148816]
We propose NEgative-prompt-driven AlignmenT to guide language models away from undesirable behaviors.
NEAT explicitly penalizes the model for producing harmful outputs, guiding it not only toward desirable behaviors but also steering it away from generating undesirable, biased responses.
Extensive experiments validate NEAT's effectiveness in significantly enhancing language models' alignment with human values and preferences.
arXiv Detail & Related papers (2024-10-16T03:30:09Z) - RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold [41.28168368547099]
Training on model-generated synthetic data is a promising approach for finetuning LLMs, but it remains unclear when it helps or hurts.
We show that training on per-step negatives can help to unlearn spurious correlations in the positive data.
arXiv Detail & Related papers (2024-06-20T17:45:54Z) - Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data [102.16105233826917]
Learning from preference labels plays a crucial role in fine-tuning large language models.
There are several distinct approaches for preference fine-tuning, including supervised learning, on-policy reinforcement learning (RL), and contrastive learning.
arXiv Detail & Related papers (2024-04-22T17:20:18Z) - Generating Negative Samples for Sequential Recommendation [83.60655196391855]
We propose to Generate Negative Samples (items) for Sequential Recommendation (SR)
A negative item is sampled at each time step based on the current SR model's learned user preferences toward items.
Experiments on four public datasets verify the importance of providing high-quality negative samples for SR.
arXiv Detail & Related papers (2022-08-07T05:44:13Z) - Negative Sampling for Recommendation [7.758275614033198]
How to effectively sample high-quality negative instances is important for well training a recommendation model.
We argue that a high-quality negative should be both textitinformativeness and textitunbiasedness
arXiv Detail & Related papers (2022-04-02T09:50:19Z) - Mixture Proportion Estimation and PU Learning: A Modern Approach [47.34499672878859]
Given only positive examples and unlabeled examples, we might hope to estimate an accurate positive-versus-negative classifier.
classical methods for both problems break down in high-dimensional settings.
We propose two simple techniques: Best Bin Estimation (BBE) and Value Ignoring Risk (CVIR)
arXiv Detail & Related papers (2021-11-01T14:42:23Z) - Towards Overcoming False Positives in Visual Relationship Detection [95.15011997876606]
We investigate the cause of the high false positive rate in Visual Relationship Detection (VRD)
This paper presents Spatially-Aware Balanced negative pRoposal sAmpling (SABRA) as a robust VRD framework that alleviates the influence of false positives.
arXiv Detail & Related papers (2020-12-23T06:28:00Z) - Simplify and Robustify Negative Sampling for Implicit Collaborative
Filtering [42.832851785261894]
In this paper, we first provide a novel understanding of negative instances by empirically observing that only a few instances are potentially important for model learning.
We then tackle the untouched false negative problem by favouring high-variance samples stored in memory.
Empirical results on two synthetic datasets and three real-world datasets demonstrate both robustness and superiorities of our negative sampling method.
arXiv Detail & Related papers (2020-09-07T19:08:26Z) - NPCFace: Negative-Positive Collaborative Training for Large-scale Face
Recognition [78.21084529159577]
We study how to make better use of hard samples for improving the training.
The correlation between hard positive and hard negative is overlooked, and so is the relation between the margins in positive and negative logits.
We propose a novel Negative-Positive Collaboration loss, named NPCFace, which emphasizes the training on both negative and positive hard cases.
arXiv Detail & Related papers (2020-07-20T14:52:29Z) - Understanding Negative Sampling in Graph Representation Learning [87.35038268508414]
We show that negative sampling is as important as positive sampling in determining the optimization objective and the resulted variance.
We propose Metropolis-Hastings (MCNS) to approximate the positive distribution with self-contrast approximation and accelerate negative sampling by Metropolis-Hastings.
We evaluate our method on 5 datasets that cover extensive downstream graph learning tasks, including link prediction, node classification and personalized recommendation.
arXiv Detail & Related papers (2020-05-20T06:25:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.