Related papers: Safety Alignment via Constrained Knowledge Unlearning

Safety Alignment via Constrained Knowledge Unlearning

URL: http://arxiv.org/abs/2505.18588v1
Date: Sat, 24 May 2025 08:29:50 GMT
Title: Safety Alignment via Constrained Knowledge Unlearning
Authors: Zesheng Shi, Yucheng Zhou, Jing Li,
Abstract summary: We propose a novel safety alignment strategy, Constrained Knowledge Unlearning (CKU)<n>CKU focuses on two primary objectives: knowledge localization and retention, and unlearning harmful knowledge.<n> Experimental results demonstrate that CKU significantly enhances model safety without compromising overall performance.
Score: 11.225354394106226
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite significant progress in safety alignment, large language models (LLMs) remain susceptible to jailbreak attacks. Existing defense mechanisms have not fully deleted harmful knowledge in LLMs, which allows such attacks to bypass safeguards and produce harmful outputs. To address this challenge, we propose a novel safety alignment strategy, Constrained Knowledge Unlearning (CKU), which focuses on two primary objectives: knowledge localization and retention, and unlearning harmful knowledge. CKU works by scoring neurons in specific multilayer perceptron (MLP) layers to identify a subset U of neurons associated with useful knowledge. During the unlearning process, CKU prunes the gradients of neurons in U to preserve valuable knowledge while effectively mitigating harmful content. Experimental results demonstrate that CKU significantly enhances model safety without compromising overall performance, offering a superior balance between safety and utility compared to existing methods. Additionally, our analysis of neuron knowledge sensitivity across various MLP layers provides valuable insights into the mechanics of safety alignment and model knowledge editing.

Related papers

KUDA: Knowledge Unlearning by Deviating Representation for Large Language Models [26.418820118903852]
Large language models (LLMs) acquire a large amount of knowledge through pre-training on vast and diverse corpora.<n>LLMs unlearning is a promising technique to reduce risks associated with sensitive, copyrighted, or harmful content in training data.<n>We propose Knowledge Unlearning by Deviating representAtion (KUDA) to achieve effective unlearning at the knowledge level of LLMs.
arXiv Detail & Related papers (2026-02-22T17:16:49Z)
Beyond Superficial Forgetting: Thorough Unlearning through Knowledge Density Estimation and Block Re-insertion [27.526437626781597]
We propose Knowledge Density-Guided Unlearning via Blocks Reinsertion (KUnBR) for large language models.<n>KUnBR identifies layers with rich harmful knowledge and then thoroughly eliminates the harmful knowledge via re-insertion strategy.<n>Experiments conducted on several unlearning and general capability benchmarks demonstrate that KUnBR achieves state-of-the-art forgetting performance.
arXiv Detail & Related papers (2025-11-11T14:12:43Z)
CLUE: Conflict-guided Localization for LLM Unlearning Framework [35.90665719234101]
We propose a Conflict-guided localization for LLM Unlearning framEwork.<n>This framework identifies the forget and retain circuit composed of important neurons, and then the circuits are transformed into conjunctive normal forms.<n>Experiments demonstrate that CLUE achieves superior forget efficacy and retain utility through precise neural localization.
arXiv Detail & Related papers (2025-09-25T10:23:16Z)
NeuroBreak: Unveil Internal Jailbreak Mechanisms in Large Language Models [68.09675063543402]
NeuroBreak is a top-down jailbreak analysis system designed to analyze neuron-level safety mechanisms and mitigate vulnerabilities.<n>By incorporating layer-wise representation probing analysis, NeuroBreak offers a novel perspective on the model's decision-making process.<n>We conduct quantitative evaluations and case studies to verify the effectiveness of our system.
arXiv Detail & Related papers (2025-09-04T08:12:06Z)
Unraveling LLM Jailbreaks Through Safety Knowledge Neurons [26.157477756143166]
We present a novel neuron-level interpretability method that focuses on the role of safety-related knowledge neurons.<n>We show that adjusting the activation of safety-related neurons can effectively control the model's behavior with a mean ASR higher than 97%.<n>We propose SafeTuning, a fine-tuning strategy that reinforces safety-critical neurons to improve model robustness.
arXiv Detail & Related papers (2025-09-01T17:17:06Z)
Saffron-1: Safety Inference Scaling [69.61130284742353]
SAFFRON is a novel inference scaling paradigm tailored explicitly for safety assurance.<n>Central to our approach is the introduction of a multifurcation reward model (MRM) that significantly reduces the required number of reward model evaluations.<n>We publicly release our trained multifurcation reward model (Saffron-1) and the accompanying token-level safety reward dataset (Safety4M)
arXiv Detail & Related papers (2025-06-06T18:05:45Z)
Unveiling Knowledge Utilization Mechanisms in LLM-based Retrieval-Augmented Generation [77.10390725623125]
retrieval-augmented generation (RAG) is widely employed to expand their knowledge scope.<n>Since RAG has shown promise in knowledge-intensive tasks like open-domain question answering, its broader application to complex tasks and intelligent assistants has further advanced its utility.<n>We present a systematic investigation of the intrinsic mechanisms by which RAGs integrate internal (parametric) and external (retrieved) knowledge.
arXiv Detail & Related papers (2025-05-17T13:13:13Z)
Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs [83.11815479874447]
We propose a novel jailbreak attack framework, inspired by cognitive decomposition and biases in human cognition.<n>We employ cognitive decomposition to reduce the complexity of malicious prompts and relevance bias to reorganize prompts.<n>We also introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm.
arXiv Detail & Related papers (2025-05-03T05:28:11Z)
NeuRel-Attack: Neuron Relearning for Safety Disalignment in Large Language Models [14.630626774362606]
Safety alignment in large language models (LLMs) is achieved through fine-tuning mechanisms that regulate neuron activations to suppress harmful content.<n>We propose a novel approach to induce disalignment by identifying and modifying the neurons responsible for safety constraints.
arXiv Detail & Related papers (2025-04-29T05:49:35Z)
Improving LLM Safety Alignment with Dual-Objective Optimization [65.41451412400609]
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks.<n>We propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge.
arXiv Detail & Related papers (2025-03-05T18:01:05Z)
Beyond Single-Value Metrics: Evaluating and Enhancing LLM Unlearning with Cognitive Diagnosis [34.62178125699054]
UNCD (UNlearning evaluation via Cognitive Diagnosis) is a novel framework for fine-grained evaluation of LLM unlearning.<n>Our benchmark, UNCD-Cyber, provides a detailed assessment of the removal of dangerous capabilities.<n>Our dedicated benchmark, UNCD-Cyber, provides a detailed assessment of the removal of dangerous capabilities.
arXiv Detail & Related papers (2025-02-19T06:56:59Z)
Internal Activation as the Polar Star for Steering Unsafe LLM Behavior [50.463399903987245]
We introduce SafeSwitch, a framework that dynamically regulates unsafe outputs by monitoring and utilizing the model's internal states.<n>Our empirical results show that SafeSwitch reduces harmful outputs by over 80% on safety benchmarks while maintaining strong utility.
arXiv Detail & Related papers (2025-02-03T04:23:33Z)
Open Problems in Machine Unlearning for AI Safety [61.43515658834902]
Machine unlearning -- the ability to selectively forget or suppress specific types of knowledge -- has shown promise for privacy and data removal tasks.<n>In this paper, we identify key limitations that prevent unlearning from serving as a comprehensive solution for AI safety.
arXiv Detail & Related papers (2025-01-09T03:59:10Z)
Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level [10.476222570886483]
Large language models (LLMs) have demonstrated immense utility across various industries.<n>As LLMs advance, the risk of harmful outputs increases due to incorrect or malicious instruction prompts.<n>This paper examines the LLMs' capability to recognize harmful outputs, revealing and quantifying their proficiency in assessing the danger of previous tokens.
arXiv Detail & Related papers (2024-10-09T12:09:30Z)
Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons [57.07507194465299]
Large language models (LLMs) excel in various capabilities but pose safety risks such as generating harmful content and misinformation, even after safety alignment.<n>We focus on identifying and analyzing safety neurons within LLMs that are responsible for safety behaviors.<n>We propose inference-time activation contrasting to locate these neurons and dynamic activation patching to evaluate their causal effects on model safety.
arXiv Detail & Related papers (2024-06-20T09:35:22Z)
Learning to Poison Large Language Models During Instruction Tuning [12.521338629194503]
This work identifies additional security risks in Large Language Models (LLMs) by designing a new data poisoning attack tailored to exploit the instruction tuning process. We propose a novel gradient-guided backdoor trigger learning (GBTL) algorithm to identify adversarial triggers efficiently. We propose two defense strategies against data poisoning attacks, including in-context learning (ICL) and continuous learning (CL)
arXiv Detail & Related papers (2024-02-21T01:30:03Z)
Towards Safer Large Language Models through Machine Unlearning [19.698620794387338]
Selective Knowledge Unlearning ( SKU) is designed to eliminate harmful knowledge while preserving utility on normal prompts. First stage aims to identify and acquire harmful knowledge within the model, whereas the second is dedicated to remove this knowledge. Our experiments demonstrate that SKU identifies a good balance point between removing harmful information and preserving utility.
arXiv Detail & Related papers (2024-02-15T16:28:34Z)
The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness [56.174255970895466]
Large Language Models (LLMs) play an increasingly pivotal role in natural language processing applications. This paper presents Safety and Over-Defensiveness Evaluation (SODE) benchmark.
arXiv Detail & Related papers (2023-12-30T17:37:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.