Related papers: Self-Destructive Language Model

Self-Destructive Language Model

URL: http://arxiv.org/abs/2505.12186v1
Date: Sun, 18 May 2025 01:08:18 GMT
Title: Self-Destructive Language Model
Authors: Yuhui Wang, Rongyi Zhu, Ting Wang,
Abstract summary: Harmful fine-tuning attacks pose a major threat to the security of large language models (LLMs)<n>We introduce SEAM, a novel alignment-enhancing defense that transforms LLMs into self-destructive models with intrinsic resilience to misalignment attempts.
Score: 13.808746955144771
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Harmful fine-tuning attacks pose a major threat to the security of large language models (LLMs), allowing adversaries to compromise safety guardrails with minimal harmful data. While existing defenses attempt to reinforce LLM alignment, they fail to address models' inherent "trainability" on harmful data, leaving them vulnerable to stronger attacks with increased learning rates or larger harmful datasets. To overcome this critical limitation, we introduce SEAM, a novel alignment-enhancing defense that transforms LLMs into self-destructive models with intrinsic resilience to misalignment attempts. Specifically, these models retain their capabilities for legitimate tasks while exhibiting substantial performance degradation when fine-tuned on harmful data. The protection is achieved through a novel loss function that couples the optimization trajectories of benign and harmful data, enhanced with adversarial gradient ascent to amplify the self-destructive effect. To enable practical training, we develop an efficient Hessian-free gradient estimate with theoretical error bounds. Extensive evaluation across LLMs and datasets demonstrates that SEAM creates a no-win situation for adversaries: the self-destructive models achieve state-of-the-art robustness against low-intensity attacks and undergo catastrophic performance collapse under high-intensity attacks, rendering them effectively unusable. (warning: this paper contains potentially harmful content generated by LLMs.)

Related papers

Rendering Data Unlearnable by Exploiting LLM Alignment Mechanisms [3.648393062009244]
Large language models (LLMs) are increasingly trained on massive, heterogeneous text corpora.<n>This raises serious concerns about the unauthorised use of proprietary or personal data during model training.<n>We propose Disclaimer Injection, a novel data-level defence that renders text unlearnable to LLMs.
arXiv Detail & Related papers (2026-01-06T20:34:15Z)
Adaptive Defense against Harmful Fine-Tuning for Large Language Models via Bayesian Data Scheduler [67.24175911858312]
Harmful fine-tuning poses critical safety risks to fine-tuning-as-a-service for large language models.<n>Bayesian Data Scheduler (BDS) is an adaptive tuning-stage defense strategy with no need for attack simulation.<n>BDS learns the posterior distribution of each data point's safety attribute, conditioned on the fine-tuning and alignment datasets.
arXiv Detail & Related papers (2025-10-31T04:49:37Z)
Reliable Unlearning Harmful Information in LLMs with Metamorphosis Representation Projection [17.369869625390894]
We propose a Metamorphosis Representation Projection (MRP) approach to machine unlearning.<n>By implementing projective transformations in the hidden state space of specific network layers, our method effectively eliminates harmful information while preserving useful knowledge.<n> Experimental results demonstrate that our approach enables effective continuous unlearning and successfully defends against relearning attacks.
arXiv Detail & Related papers (2025-08-21T11:12:09Z)
Mitigating Data Exfiltration Attacks through Layer-Wise Learning Rate Decay Fine-Tuning [4.613829141527782]
Data lakes enable the training of powerful machine learning models on sensitive, high-value medical datasets.<n>Recent studies show adversaries can exfiltrate training data by embedding latent representations into model parameters.<n>We propose a simple yet effective mitigation strategy that perturbs model parameters at export time through fine-tuning with a decaying layer-wise learning rate.
arXiv Detail & Related papers (2025-08-20T09:05:01Z)
MISLEADER: Defending against Model Extraction with Ensembles of Distilled Models [56.09354775405601]
Model extraction attacks aim to replicate the functionality of a black-box model through query access.<n>Most existing defenses presume that attacker queries have out-of-distribution (OOD) samples, enabling them to detect and disrupt suspicious inputs.<n>We propose MISLEADER, a novel defense strategy that does not rely on OOD assumptions.
arXiv Detail & Related papers (2025-06-03T01:37:09Z)
CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning [12.293101110323722]
Fine-tuning-as-a-service exposes models to harmful fine-tuning attacks.<n>We propose a paradigm shift: instead of selective removal, we advocate for inducing model collapse.<n>This collapse directly neutralizes the very general capabilities that attackers exploit.
arXiv Detail & Related papers (2025-05-22T11:47:08Z)
Towards LLM Unlearning Resilient to Relearning Attacks: A Sharpness-Aware Minimization Perspective and Beyond [41.3029262040131]
We investigate how to make unlearned models robust against relearning attacks.<n>Our analysis reveals that smoothness optimization plays a pivotal role in mitigating relearning attacks.
arXiv Detail & Related papers (2025-02-07T23:03:55Z)
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities [49.09703018511403]
Evaluations of large language model (LLM) risks and capabilities are increasingly being incorporated into AI risk management and governance frameworks.<n>Currently, most risk evaluations are conducted by designing inputs that elicit harmful behaviors from the system.<n>We propose evaluating LLMs with model tampering attacks which allow for modifications to latent activations or weights.
arXiv Detail & Related papers (2025-02-03T18:59:16Z)
HarmLevelBench: Evaluating Harm-Level Compliance and the Impact of Quantization on Model Alignment [1.8843687952462742]
This paper aims to address gaps in the current literature on jailbreaking techniques and the evaluation of LLM vulnerabilities. Our contributions include the creation of a novel dataset designed to assess the harmfulness of model outputs across multiple harm levels. We provide a comprehensive benchmark of state-of-the-art jailbreaking attacks, specifically targeting the Vicuna 13B v1.5 model.
arXiv Detail & Related papers (2024-11-11T10:02:49Z)
Transferable Adversarial Attacks on SAM and Its Downstream Models [87.23908485521439]
This paper explores the feasibility of adversarial attacking various downstream models fine-tuned from the segment anything model (SAM)<n>To enhance the effectiveness of the adversarial attack towards models fine-tuned on unknown datasets, we propose a universal meta-initialization (UMI) algorithm.
arXiv Detail & Related papers (2024-10-26T15:04:04Z)
Efficient Adversarial Training in LLMs with Continuous Attacks [99.5882845458567]
Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their safety guardrails. We propose a fast adversarial training algorithm (C-AdvUL) composed of two losses. C-AdvIPO is an adversarial variant of IPO that does not require utility data for adversarially robust alignment.
arXiv Detail & Related papers (2024-05-24T14:20:09Z)
RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content [62.685566387625975]
Current mitigation strategies, while effective, are not resilient under adversarial attacks. This paper introduces Resilient Guardrails for Large Language Models (RigorLLM), a novel framework designed to efficiently moderate harmful and unsafe inputs.
arXiv Detail & Related papers (2024-03-19T07:25:02Z)
Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections [17.49244337226907]
We show that it is possible to conduct stealthy and persistent unalignment on large language models via backdoor injections. Our proposed stealthy and persistent unalignment can successfully pass the safety evaluation while maintaining strong persistence against re-alignment defense.
arXiv Detail & Related papers (2023-11-15T23:52:05Z)
Isolation and Induction: Training Robust Deep Neural Networks against Model Stealing Attacks [51.51023951695014]
Existing model stealing defenses add deceptive perturbations to the victim's posterior probabilities to mislead the attackers. This paper proposes Isolation and Induction (InI), a novel and effective training framework for model stealing defenses. In contrast to adding perturbations over model predictions that harm the benign accuracy, we train models to produce uninformative outputs against stealing queries.
arXiv Detail & Related papers (2023-08-02T05:54:01Z)
RelaxLoss: Defending Membership Inference Attacks without Losing Utility [68.48117818874155]
We propose a novel training framework based on a relaxed loss with a more achievable learning target. RelaxLoss is applicable to any classification model with added benefits of easy implementation and negligible overhead. Our approach consistently outperforms state-of-the-art defense mechanisms in terms of resilience against MIAs.
arXiv Detail & Related papers (2022-07-12T19:34:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.