Negating Negatives: Alignment without Human Positive Samples via
Distributional Dispreference Optimization
- URL: http://arxiv.org/abs/2403.03419v1
- Date: Wed, 6 Mar 2024 03:02:38 GMT
- Title: Negating Negatives: Alignment without Human Positive Samples via
Distributional Dispreference Optimization
- Authors: Shitong Duan, Xiaoyuan Yi, Peng Zhang, Tun Lu, Xing Xie, Ning Gu
- Abstract summary: Large language models (LLMs) have revolutionized the role of AI, yet pose potential risks of propagating unethical content.
This work focuses on achieving alignment using solely human-annotated negative samples.
- Score: 36.66806788879868
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have revolutionized the role of AI, yet also
pose potential risks of propagating unethical content. Alignment technologies
have been introduced to steer LLMs towards human preference, gaining increasing
attention. Despite notable breakthroughs in this direction, existing methods
heavily rely on high-quality positive-negative training pairs, suffering from
noisy labels and the marginal distinction between preferred and dispreferred
response data. Given recent LLMs' proficiency in generating helpful responses,
this work pivots towards a new research focus: achieving alignment using solely
human-annotated negative samples, preserving helpfulness while reducing
harmfulness. For this purpose, we propose Distributional Dispreference
Optimization (D$^2$O), which maximizes the discrepancy between the generated
responses and the dispreferred ones to effectively eschew harmful information.
We theoretically demonstrate that D$^2$O is equivalent to learning a
distributional instead of instance-level preference model reflecting human
dispreference against the distribution of negative responses. Besides, D$^2$O
integrates an implicit Jeffrey Divergence regularization to balance the
exploitation and exploration of reference policies and converges to a
non-negative one during training. Extensive experiments demonstrate that our
method achieves comparable generation quality and surpasses the latest
baselines in producing less harmful and more informative responses with better
training stability and faster convergence.
Related papers
- Refining Positive and Toxic Samples for Dual Safety Self-Alignment of LLMs with Minimal Human Interventions [17.485655062129965]
Recent AI agents rely on instruction tuning and reinforcement learning to calibrate the output of large language models (LLMs) with human intentions.
We propose PT-ALIGN, a novel safety self-alignment approach that minimizes human supervision by automatically refining positive and toxic samples.
Experiments on 9 popular open-source LLMs demonstrate the effectiveness of our PT-ALIGN for safety alignment, while maintaining comparable levels of helpfulness and usefulness.
arXiv Detail & Related papers (2025-02-08T09:54:47Z) - SyNeg: LLM-Driven Synthetic Hard-Negatives for Dense Retrieval [45.971786380884126]
The performance of Dense retrieval (DR) is significantly influenced by the quality of negative sampling.
Recent advancements in large language models (LLMs) offer an innovative solution by generating contextually rich and diverse negative samples.
In this work, we present a framework that harnesses LLMs to synthesize high-quality hard negative samples.
arXiv Detail & Related papers (2024-12-23T03:49:00Z) - Negative-Prompt-driven Alignment for Generative Language Model [34.191590966148816]
We propose NEgative-prompt-driven AlignmenT to guide language models away from undesirable behaviors.
NEAT explicitly penalizes the model for producing harmful outputs, guiding it not only toward desirable behaviors but also steering it away from generating undesirable, biased responses.
Extensive experiments validate NEAT's effectiveness in significantly enhancing language models' alignment with human values and preferences.
arXiv Detail & Related papers (2024-10-16T03:30:09Z) - Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data [102.16105233826917]
Learning from preference labels plays a crucial role in fine-tuning large language models.
There are several distinct approaches for preference fine-tuning, including supervised learning, on-policy reinforcement learning (RL), and contrastive learning.
arXiv Detail & Related papers (2024-04-22T17:20:18Z) - Generating Negative Samples for Sequential Recommendation [83.60655196391855]
We propose to Generate Negative Samples (items) for Sequential Recommendation (SR)
A negative item is sampled at each time step based on the current SR model's learned user preferences toward items.
Experiments on four public datasets verify the importance of providing high-quality negative samples for SR.
arXiv Detail & Related papers (2022-08-07T05:44:13Z) - Negative Sampling for Recommendation [7.758275614033198]
How to effectively sample high-quality negative instances is important for well training a recommendation model.
We argue that a high-quality negative should be both textitinformativeness and textitunbiasedness
arXiv Detail & Related papers (2022-04-02T09:50:19Z) - Mixture Proportion Estimation and PU Learning: A Modern Approach [47.34499672878859]
Given only positive examples and unlabeled examples, we might hope to estimate an accurate positive-versus-negative classifier.
classical methods for both problems break down in high-dimensional settings.
We propose two simple techniques: Best Bin Estimation (BBE) and Value Ignoring Risk (CVIR)
arXiv Detail & Related papers (2021-11-01T14:42:23Z) - Towards Overcoming False Positives in Visual Relationship Detection [95.15011997876606]
We investigate the cause of the high false positive rate in Visual Relationship Detection (VRD)
This paper presents Spatially-Aware Balanced negative pRoposal sAmpling (SABRA) as a robust VRD framework that alleviates the influence of false positives.
arXiv Detail & Related papers (2020-12-23T06:28:00Z) - NPCFace: Negative-Positive Collaborative Training for Large-scale Face
Recognition [78.21084529159577]
We study how to make better use of hard samples for improving the training.
The correlation between hard positive and hard negative is overlooked, and so is the relation between the margins in positive and negative logits.
We propose a novel Negative-Positive Collaboration loss, named NPCFace, which emphasizes the training on both negative and positive hard cases.
arXiv Detail & Related papers (2020-07-20T14:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.