Negative-Prompt-driven Alignment for Generative Language Model
- URL: http://arxiv.org/abs/2410.12194v1
- Date: Wed, 16 Oct 2024 03:30:09 GMT
- Title: Negative-Prompt-driven Alignment for Generative Language Model
- Authors: Shiqi Qiao, Ning Xv, Biao Liu, Xin Geng,
- Abstract summary: We propose NEgative-prompt-driven AlignmenT to guide language models away from undesirable behaviors.
NEAT explicitly penalizes the model for producing harmful outputs, guiding it not only toward desirable behaviors but also steering it away from generating undesirable, biased responses.
Extensive experiments validate NEAT's effectiveness in significantly enhancing language models' alignment with human values and preferences.
- Score: 34.191590966148816
- License:
- Abstract: Large language models have achieved remarkable capabilities, but aligning their outputs with human values and preferences remains a significant challenge. Existing alignment methods primarily focus on positive examples while overlooking the importance of negative responses in guiding models away from undesirable behaviors. For instance, the widely-used alignment datasets reveals a scarcity of explicit negative examples that contradict human values, hindering its ability to discourage harmful or biased outputs during training. To address this limitation, we propose NEAT, i.e., NEgative-prompt-driven AlignmenT, to introduce negative prompts to generate undesirable responses alongside positive examples during the optimization process. NEAT explicitly penalizes the model for producing harmful outputs, guiding it not only toward desirable behaviors but also steering it away from generating undesirable, biased responses. This dual feedback mechanism enables better alignment with human preferences, crucial in contexts where avoiding harm is paramount. Starting from a pre-trained language model, NEAT performs online alignment by incorporating a ranking loss derived from an expanded preference dataset containing both positive and negative examples. Extensive experiments validate NEAT's effectiveness in significantly enhancing language models' alignment with human values and preferences.
Related papers
- Just Say What You Want: Only-prompting Self-rewarding Online Preference Optimization [64.34767799614328]
Current self-rewarding approaches rely heavily on the discriminator's judgment capabilities.
We propose a novel, only-prompting self-rewarding online algorithm that generates preference datasets without relying on judgment capabilities.
arXiv Detail & Related papers (2024-09-26T04:41:08Z) - Alternate Preference Optimization for Unlearning Factual Knowledge in Large Language Models [2.0962367975513496]
Machine unlearning aims to efficiently eliminate the influence of specific training data, known as the forget set, from the model.
Existing unlearning methods rely solely on negative feedback to suppress responses related to the forget set.
We propose a novel approach called Alternate Preference Optimization (AltPO), which combines negative feedback with in-domain positive feedback on the forget set.
arXiv Detail & Related papers (2024-09-20T13:05:07Z) - Towards Unified Modeling for Positive and Negative Preferences in
Sign-Aware Recommendation [13.300975621769396]
We propose a novel textbfLight textbfSigned textbfGraph Convolution Network specifically for textbfRecommendation (textbfLSGRec)
For the negative preferences within high-order heterogeneous interactions, first-order negative preferences are captured by the negative links.
recommendation results are generated based on positive preferences and optimized with negative ones.
arXiv Detail & Related papers (2024-03-13T05:00:42Z) - Negating Negatives: Alignment with Human Negative Samples via Distributional Dispreference Optimization [37.8788435790632]
Large language models (LLMs) have revolutionized the role of AI, yet pose potential social risks.
Existing methods rely on high-quality positive-negative training pairs, suffering from noisy positive responses that are barely distinguishable from negative ones.
We propose Distributional Dispreference Optimization (D$2$O), which maximizes the discrepancy between dispreferred responses and the generated non-negative ones.
arXiv Detail & Related papers (2024-03-06T03:02:38Z) - Generating Enhanced Negatives for Training Language-Based Object Detectors [86.1914216335631]
We propose to leverage the vast knowledge built into modern generative models to automatically build negatives that are more relevant to the original data.
Specifically, we use large-language-models to generate negative text descriptions, and text-to-image diffusion models to also generate corresponding negative images.
Our experimental analysis confirms the relevance of the generated negative data, and its use in language-based detectors improves performance on two complex benchmarks.
arXiv Detail & Related papers (2023-12-29T23:04:00Z) - ULMA: Unified Language Model Alignment with Human Demonstration and
Point-wise Preference [16.73260713938154]
A typical alignment procedure consists of supervised fine-tuning and preference learning.
We introduce Point-wise Direct Preference Optimization, a novel preference learning method designed to harness point-wise feedback effectively.
Our work also uncovers a novel connection between supervised fine-tuning and point-wise preference learning, culminating in Unified Language Model Alignment.
arXiv Detail & Related papers (2023-12-05T07:52:12Z) - Language Model Pre-training on True Negatives [109.73819321246062]
Discriminative pre-trained language models (PLMs) learn to predict original texts from intentionally corrupted ones.
Existing PLMs simply treat all corrupted texts as equal negative without any examination.
We design enhanced pre-training methods to counteract false negative predictions and encourage pre-training language models on true negatives.
arXiv Detail & Related papers (2022-12-01T12:24:19Z) - Generating Negative Samples for Sequential Recommendation [83.60655196391855]
We propose to Generate Negative Samples (items) for Sequential Recommendation (SR)
A negative item is sampled at each time step based on the current SR model's learned user preferences toward items.
Experiments on four public datasets verify the importance of providing high-quality negative samples for SR.
arXiv Detail & Related papers (2022-08-07T05:44:13Z) - A Mutually Reinforced Framework for Pretrained Sentence Embeddings [49.297766436632685]
InfoCSE is a novel framework for learning high-quality sentence embeddings.
It exploits the sentence representation model itself and realizes the following iterative self-supervision process.
In other words, the representation learning and data annotation become mutually reinforced, where a strong self-supervision effect can be derived.
arXiv Detail & Related papers (2022-02-28T14:00:16Z) - Contrastive Learning with Adversarial Perturbations for Conditional Text
Generation [49.055659008469284]
We propose a principled method to generate positive and negative samples for contrastive learning of seq2seq models.
Specifically, we generate negative examples by adding small perturbations to the input sequence to minimize its conditional likelihood.
We empirically show that our proposed method significantly improves the generalization of the seq2seq on three text generation tasks.
arXiv Detail & Related papers (2020-12-14T06:20:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.