Related papers: NeuRel-Attack: Neuron Relearning for Safety Disalignment in Large Language Models

NeuRel-Attack: Neuron Relearning for Safety Disalignment in Large Language Models

URL: http://arxiv.org/abs/2504.21053v1
Date: Tue, 29 Apr 2025 05:49:35 GMT
Title: NeuRel-Attack: Neuron Relearning for Safety Disalignment in Large Language Models
Authors: Yi Zhou, Wenpeng Xing, Dezhang Kong, Changting Lin, Meng Han,
Abstract summary: Safety alignment in large language models (LLMs) is achieved through fine-tuning mechanisms that regulate neuron activations to suppress harmful content.<n>We propose a novel approach to induce disalignment by identifying and modifying the neurons responsible for safety constraints.
Score: 14.630626774362606
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Safety alignment in large language models (LLMs) is achieved through fine-tuning mechanisms that regulate neuron activations to suppress harmful content. In this work, we propose a novel approach to induce disalignment by identifying and modifying the neurons responsible for safety constraints. Our method consists of three key steps: Neuron Activation Analysis, where we examine activation patterns in response to harmful and harmless prompts to detect neurons that are critical for distinguishing between harmful and harmless inputs; Similarity-Based Neuron Identification, which systematically locates the neurons responsible for safe alignment; and Neuron Relearning for Safety Removal, where we fine-tune these selected neurons to restore the model's ability to generate previously restricted responses. Experimental results demonstrate that our method effectively removes safety constraints with minimal fine-tuning, highlighting a critical vulnerability in current alignment techniques. Our findings underscore the need for robust defenses against adversarial fine-tuning attacks on LLMs.

Related papers

Mechanistic Interpretability of LoRA-Adapted Language Models for Nuclear Reactor Safety Applications [0.0]
This paper presents a novel methodology for interpreting how Large Language Models encode and utilize domain-specific knowledge.<n>We adapted a general-purpose LLM to the nuclear domain using a parameter-efficient fine-tuning technique known as Low-Rank Adaptation.<n>By comparing the neuron activation patterns of the base model to those of the fine-tuned model, we identified a sparse set of neurons whose behavior was significantly altered.
arXiv Detail & Related papers (2025-07-14T05:17:41Z)
Shape it Up! Restoring LLM Safety during Finetuning [66.46166656543761]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z)
Finding Safety Neurons in Large Language Models [44.873565067389016]
Large language models (LLMs) excel in various capabilities but also pose safety risks such as generating harmful content and misinformation. In this paper, we explore the inner mechanisms of safety alignment from the perspective of mechanistic interpretability. We propose generation-time activation contrasting to locate these neurons and dynamic activation patching to evaluate their causal effects.
arXiv Detail & Related papers (2024-06-20T09:35:22Z)
Adversarial Defense via Neural Oscillation inspired Gradient Masking [0.0]
Spiking neural networks (SNNs) attract great attention due to their low power consumption, low latency, and biological plausibility. We propose a novel neural model that incorporates the bio-inspired oscillation mechanism to enhance the security of SNNs.
arXiv Detail & Related papers (2022-11-04T02:13:19Z)
Defense against Backdoor Attacks via Identifying and Purifying Bad Neurons [36.57541102989073]
We propose a novel backdoor defense method to mark and purify infected neurons in neural networks. New metric, called benign salience, can identify infected neurons with higher accuracy than the commonly used metric in backdoor defense. New Adaptive Regularization (AR) mechanism is proposed to assist in purifying these identified infected neurons.
arXiv Detail & Related papers (2022-08-13T01:10:20Z)
Improving Adversarial Transferability via Neuron Attribution-Based Attacks [35.02147088207232]
We propose the Neuron-based Attack (NAA), which conducts feature-level attacks with more accurate neuron importance estimations. We derive an approximation scheme of neuron attribution to tremendously reduce the overhead. Experiments confirm the superiority of our approach to the state-of-the-art benchmarks.
arXiv Detail & Related papers (2022-03-31T13:47:30Z)
DeepSensor: Deep Learning Testing Framework Based on Neuron Sensitivity [20.40306955830653]
Existing testing methods have provided fine-grained criteria based on neuron coverage and reached high exploratory degree of testing. To bridge the gap, we observed that neurons which change the activation value dramatically due to minor perturbation are prone to trigger incorrect corner cases. Motivated by it, we propose neuron sensitivity and develop a novel white-box testing framework for DNN, donated as DeepSensor.
arXiv Detail & Related papers (2022-02-12T16:44:15Z)
Few-shot Backdoor Defense Using Shapley Estimation [123.56934991060788]
We develop a new approach called Shapley Pruning to mitigate backdoor attacks on deep neural networks. ShapPruning identifies the few infected neurons (under 1% of all neurons) and manages to protect the model's structure and accuracy. Experiments demonstrate the effectiveness and robustness of our method against various attacks and tasks.
arXiv Detail & Related papers (2021-12-30T02:27:03Z)
Fight Perturbations with Perturbations: Defending Adversarial Attacks via Neuron Influence [14.817015950058915]
We propose emphNeuron-level Inverse Perturbation (NIP), a novel defense against general adversarial attacks. It calculates neuron influence from benign examples and then modifies input examples by generating inverse perturbations.
arXiv Detail & Related papers (2021-12-24T13:37:42Z)
Overcoming the Domain Gap in Contrastive Learning of Neural Action Representations [60.47807856873544]
A fundamental goal in neuroscience is to understand the relationship between neural activity and behavior. We generated a new multimodal dataset consisting of the spontaneous behaviors generated by fruit flies. This dataset and our new set of augmentations promise to accelerate the application of self-supervised learning methods in neuroscience.
arXiv Detail & Related papers (2021-11-29T15:27:51Z)
And/or trade-off in artificial neurons: impact on adversarial robustness [91.3755431537592]
Presence of sufficient number of OR-like neurons in a network can lead to classification brittleness and increased vulnerability to adversarial attacks. We define AND-like neurons and propose measures to increase their proportion in the network. Experimental results on the MNIST dataset suggest that our approach holds promise as a direction for further exploration.
arXiv Detail & Related papers (2021-02-15T08:19:05Z)
Artificial Neural Variability for Deep Learning: On Overfitting, Noise Memorization, and Catastrophic Forgetting [135.0863818867184]
artificial neural variability (ANV) helps artificial neural networks learn some advantages from natural'' neural networks. ANV plays as an implicit regularizer of the mutual information between the training data and the learned model. It can effectively relieve overfitting, label noise memorization, and catastrophic forgetting at negligible costs.
arXiv Detail & Related papers (2020-11-12T06:06:33Z)
Towards Efficient Processing and Learning with Spikes: New Approaches for Multi-Spike Learning [59.249322621035056]
We propose two new multi-spike learning rules which demonstrate better performance over other baselines on various tasks. In the feature detection task, we re-examine the ability of unsupervised STDP with its limitations being presented. Our proposed learning rules can reliably solve the task over a wide range of conditions without specific constraints being applied.
arXiv Detail & Related papers (2020-05-02T06:41:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.