SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models
- URL: http://arxiv.org/abs/2508.15648v1
- Date: Thu, 21 Aug 2025 15:26:09 GMT
- Title: SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models
- Authors: Peng Ding, Wen Sun, Dailin Li, Wei Zou, Jiaming Wang, Jiajun Chen, Shujian Huang,
- Abstract summary: Large Language Models (LLMs) excel at various natural language processing tasks but remain vulnerable to jailbreaking attacks.<n>This paper explores aligning the model's inherent discrimination and generation capabilities.<n>Our method does not require any additional annotated data or external models during the training phase.
- Score: 59.217270662809696
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) excel at various natural language processing tasks but remain vulnerable to jailbreaking attacks that induce harmful content generation. In this paper, we reveal a critical safety inconsistency: LLMs can more effectively identify harmful requests as discriminators than defend against them as generators. This insight inspires us to explore aligning the model's inherent discrimination and generation capabilities. To this end, we propose SDGO (Self-Discrimination-Guided Optimization), a reinforcement learning framework that leverages the model's own discrimination capabilities as a reward signal to enhance generation safety through iterative self-improvement. Our method does not require any additional annotated data or external models during the training phase. Extensive experiments demonstrate that SDGO significantly improves model safety compared to both prompt-based and training-based baselines while maintaining helpfulness on general benchmarks. By aligning LLMs' discrimination and generation capabilities, SDGO brings robust performance against out-of-distribution (OOD) jailbreaking attacks. This alignment achieves tighter coupling between these two capabilities, enabling the model's generation capability to be further enhanced with only a small amount of discriminative samples. Our code and datasets are available at https://github.com/NJUNLP/SDGO.
Related papers
- GUARD: Generation-time LLM Unlearning via Adaptive Restriction and Detection [36.38245533018162]
Large Language Models (LLMs) have demonstrated strong capabilities in memorizing vast amounts of knowledge across diverse domains.<n>Existing unlearning efforts typically fine-tune the model with resources such as forget data, retain data, and a calibration model.<n>We propose Generation-time Unlearning via Adaptive Restriction and Detection (GUARD), a framework that enables dynamic unlearning during LLM generation.
arXiv Detail & Related papers (2025-05-19T16:26:58Z) - Why Not Act on What You Know? Unleashing Safety Potential of LLMs via Self-Aware Guard Enhancement [48.50995874445193]
Large Language Models (LLMs) have shown impressive capabilities across various tasks but remain vulnerable to meticulously crafted jailbreak attacks.<n>We propose SAGE (Self-Aware Guard Enhancement), a training-free defense strategy designed to align LLMs' strong safety discrimination performance with their relatively weaker safety generation ability.
arXiv Detail & Related papers (2025-05-17T15:54:52Z) - Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? [83.53005932513155]
Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited.<n>We propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences.
arXiv Detail & Related papers (2025-04-14T09:03:51Z) - A Generative Approach to LLM Harmfulness Detection with Special Red Flag Tokens [15.796683630119654]
We propose to train the model to insert a red flag token into its response at any time when harmful content is generated or about to be generated.<n>Our approach offers several advantages: it enables the model to explicitly learn the concept of harmfulness while marginally affecting the generated distribution.<n>It also evaluates each generated answer and provides robustness as good as adversarial training without the need to run attacks during training.
arXiv Detail & Related papers (2025-02-22T21:48:48Z) - Smoothed Embeddings for Robust Language Models [11.97873981355746]
Large language models (LLMs) are vulnerable to jailbreaking attacks that subvert alignment and induce harmful outputs.<n>We propose the Randomized Embedding Smoothing and Token Aggregation (RESTA) defense, which adds random noise to the embedding vectors and performs aggregation during the generation of each output token.<n>Our experiments demonstrate that our approach achieves superior robustness versus utility tradeoffs compared to the baseline defenses.
arXiv Detail & Related papers (2025-01-27T20:57:26Z) - SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models [19.486685336959482]
Large language models (LLMs) continue to advance in capability and influence, ensuring their security and preventing harmful outputs has become crucial.<n>A promising approach to address these concerns involves training models to automatically generate adversarial prompts for red teaming.<n>We introduce the $mathbfStextelf-mathbfEtextvolving mathbfAtextdversarial mathbfStextafetyety mathbf(SEAS)$ optimization framework, which enhances security by leveraging data generated by the model itself.
arXiv Detail & Related papers (2024-08-05T16:55:06Z) - Enhancing Multiple Reliability Measures via Nuisance-extended
Information Bottleneck [77.37409441129995]
In practical scenarios where training data is limited, many predictive signals in the data can be rather from some biases in data acquisition.
We consider an adversarial threat model under a mutual information constraint to cover a wider class of perturbations in training.
We propose an autoencoder-based training to implement the objective, as well as practical encoder designs to facilitate the proposed hybrid discriminative-generative training.
arXiv Detail & Related papers (2023-03-24T16:03:21Z) - RelaxLoss: Defending Membership Inference Attacks without Losing Utility [68.48117818874155]
We propose a novel training framework based on a relaxed loss with a more achievable learning target.
RelaxLoss is applicable to any classification model with added benefits of easy implementation and negligible overhead.
Our approach consistently outperforms state-of-the-art defense mechanisms in terms of resilience against MIAs.
arXiv Detail & Related papers (2022-07-12T19:34:47Z) - Training GANs with Stronger Augmentations via Contrastive Discriminator [80.8216679195]
We introduce a contrastive representation learning scheme into the GAN discriminator, coined ContraD.
This "fusion" enables the discriminators to work with much stronger augmentations without increasing their training instability.
Our experimental results show that GANs with ContraD consistently improve FID and IS compared to other recent techniques incorporating data augmentations.
arXiv Detail & Related papers (2021-03-17T16:04:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.