Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons
- URL: http://arxiv.org/abs/2406.14144v2
- Date: Thu, 23 Oct 2025 15:10:09 GMT
- Title: Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons
- Authors: Jianhui Chen, Xiaozhi Wang, Zijun Yao, Yushi Bai, Lei Hou, Juanzi Li,
- Abstract summary: Large language models (LLMs) excel in various capabilities but pose safety risks such as generating harmful content and misinformation, even after safety alignment.<n>We focus on identifying and analyzing safety neurons within LLMs that are responsible for safety behaviors.<n>We propose inference-time activation contrasting to locate these neurons and dynamic activation patching to evaluate their causal effects on model safety.
- Score: 57.07507194465299
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) excel in various capabilities but pose safety risks such as generating harmful content and misinformation, even after safety alignment. In this paper, we explore the inner mechanisms of safety alignment through the lens of mechanistic interpretability, focusing on identifying and analyzing safety neurons within LLMs that are responsible for safety behaviors. We propose inference-time activation contrasting to locate these neurons and dynamic activation patching to evaluate their causal effects on model safety. Experiments on multiple prevalent LLMs demonstrate that we can consistently identify about $5\%$ safety neurons, and by only patching their activations we can restore over $90\%$ of the safety performance across various red-teaming benchmarks without influencing general ability. The finding of safety neurons also helps explain the ''alignment tax'' phenomenon by revealing that the key neurons for model safety and helpfulness significantly overlap, yet they require different activation patterns for the same neurons. Furthermore, we demonstrate an application of our findings in safeguarding LLMs by detecting unsafe outputs before generation. The source code is available at https://github.com/THU-KEG/SafetyNeuron.
Related papers
- SafeNeuron: Neuron-Level Safety Alignment for Large Language Models [71.50117566279185]
We propose SafeNeuron, a neuron-level safety alignment framework that improves robustness by redistributing safety representations across the network.<n>In experiments, SafeNeuron significantly improves robustness against neuron pruning attacks, reduces the risk of open-source models being repurposed as red-team generators, and preserves general capabilities.
arXiv Detail & Related papers (2026-02-12T16:40:05Z) - Who Transfers Safety? Identifying and Targeting Cross-Lingual Shared Safety Neurons [49.772147495578736]
Cross-lingual shared safety neurons (SS-Neurons) regulate safety behavior across languages.<n>We propose a neuron-oriented training strategy that targets SS-Neurons based on language resource distribution and model architecture.
arXiv Detail & Related papers (2026-02-01T15:28:02Z) - Unraveling LLM Jailbreaks Through Safety Knowledge Neurons [26.157477756143166]
We present a novel neuron-level interpretability method that focuses on the role of safety-related knowledge neurons.<n>We show that adjusting the activation of safety-related neurons can effectively control the model's behavior with a mean ASR higher than 97%.<n>We propose SafeTuning, a fine-tuning strategy that reinforces safety-critical neurons to improve model robustness.
arXiv Detail & Related papers (2025-09-01T17:17:06Z) - Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks [22.059668583508365]
We propose the Fine-Grained Safety Neurons (FGSN) with Training-Free Continual Projection method to reduce the fine-tuning safety risks.<n>FGSN inherently integrates the multi-scale interactions between safety layers and neurons, localizing sparser and more precise fine-grained safety neurons.
arXiv Detail & Related papers (2025-08-08T03:20:25Z) - Shape it Up! Restoring LLM Safety during Finetuning [65.75757313781104]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z) - SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning [76.56522719330911]
Large Reasoning Models (LRMs) introduce a new generation paradigm of explicitly reasoning before answering.<n>LRMs pose great safety risks against harmful queries and adversarial attacks.<n>We propose SafeKey to better activate the safety aha moment in the key sentence.
arXiv Detail & Related papers (2025-05-22T03:46:03Z) - Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations [1.0485739694839669]
Large language models (LLMs) can sometimes report the strategies they actually use to solve tasks, but they can also fail to do so.<n>This suggests some degree of metacognition -- the capacity to monitor one's own cognitive processes for subsequent reporting and self-control.<n>We introduce a neuroscience-inspired neurofeedback paradigm designed to quantify the ability of LLMs to explicitly report and control their activation patterns.
arXiv Detail & Related papers (2025-05-19T22:32:25Z) - NeuRel-Attack: Neuron Relearning for Safety Disalignment in Large Language Models [14.630626774362606]
Safety alignment in large language models (LLMs) is achieved through fine-tuning mechanisms that regulate neuron activations to suppress harmful content.
We propose a novel approach to induce disalignment by identifying and modifying the neurons responsible for safety constraints.
arXiv Detail & Related papers (2025-04-29T05:49:35Z) - Deciphering Functions of Neurons in Vision-Language Models [37.29432842212334]
This study aims to delve into the internals of vision-language models (VLMs) to interpret the functions of individual neurons.
We observe the activations of neurons with respects to the input visual tokens and text tokens, and reveal some interesting findings.
We build a framework that automates the explanation of neurons with the assistant of GPT-4o.
For visual neurons, we propose an activation simulator to assess the reliability of the explanations for visual neurons.
arXiv Detail & Related papers (2025-02-10T10:00:06Z) - Internal Activation as the Polar Star for Steering Unsafe LLM Behavior [50.463399903987245]
We introduce SafeSwitch, a framework that dynamically regulates unsafe outputs by monitoring and utilizing the model's internal states.
Our empirical results show that SafeSwitch reduces harmful outputs by over 80% on safety benchmarks while maintaining strong utility.
arXiv Detail & Related papers (2025-02-03T04:23:33Z) - Neuron Empirical Gradient: Discovering and Quantifying Neurons Global Linear Controllability [14.693407823048478]
We show that the neuron empirical gradient (NEG) captures how changes in activations affect predictions.<n>We also show that NEG effectively captures language skills across diverse prompts through skill neuron probing.<n>Further analysis highlights the key properties of NEG-based skill representation: efficiency, robustness, flexibility, and interdependency.
arXiv Detail & Related papers (2024-12-24T00:01:24Z) - Interpreting the Second-Order Effects of Neurons in CLIP [73.54377859089801]
We interpret the function of individual neurons in CLIP by automatically describing them using text.
We present the "second-order lens", analyzing the effect flowing from a neuron through the later attention heads, directly to the output.
Our results indicate that a scalable understanding of neurons can be used for model deception and for introducing new model capabilities.
arXiv Detail & Related papers (2024-06-06T17:59:52Z) - Hebbian Learning based Orthogonal Projection for Continual Learning of
Spiking Neural Networks [74.3099028063756]
We develop a new method with neuronal operations based on lateral connections and Hebbian learning.
We show that Hebbian and anti-Hebbian learning on recurrent lateral connections can effectively extract the principal subspace of neural activities.
Our method consistently solves for spiking neural networks with nearly zero forgetting.
arXiv Detail & Related papers (2024-02-19T09:29:37Z) - Neuron-Level Knowledge Attribution in Large Language Models [19.472889262384818]
We propose a static method for pinpointing significant neurons.
Compared to seven other methods, our approach demonstrates superior performance across three metrics.
We also apply our methods to analyze six types of knowledge across both attention and feed-forward network layers.
arXiv Detail & Related papers (2023-12-19T13:23:18Z) - Causality Analysis for Evaluating the Security of Large Language Models [9.102606258312246]
Large Language Models (LLMs) are increasingly adopted in many safety-critical applications.
Recent studies have shown that LLMs are still subject to attacks such as adversarial perturbation and Trojan attacks.
We propose a framework for conducting light-weight causality-analysis of LLMs at the token, layer, and neuron level.
arXiv Detail & Related papers (2023-12-13T03:35:43Z) - Visual Analytics of Neuron Vulnerability to Adversarial Attacks on
Convolutional Neural Networks [28.081328051535618]
Adversarial attacks on a convolutional neural network (CNN) could fool a high-performance CNN into making incorrect predictions.
Our work introduces a visual analytics approach to understanding adversarial attacks.
A visual analytics system is designed to incorporate visual reasoning for interpreting adversarial attacks.
arXiv Detail & Related papers (2023-03-06T01:01:56Z) - Adversarial Defense via Neural Oscillation inspired Gradient Masking [0.0]
Spiking neural networks (SNNs) attract great attention due to their low power consumption, low latency, and biological plausibility.
We propose a novel neural model that incorporates the bio-inspired oscillation mechanism to enhance the security of SNNs.
arXiv Detail & Related papers (2022-11-04T02:13:19Z) - Defense against Backdoor Attacks via Identifying and Purifying Bad
Neurons [36.57541102989073]
We propose a novel backdoor defense method to mark and purify infected neurons in neural networks.
New metric, called benign salience, can identify infected neurons with higher accuracy than the commonly used metric in backdoor defense.
New Adaptive Regularization (AR) mechanism is proposed to assist in purifying these identified infected neurons.
arXiv Detail & Related papers (2022-08-13T01:10:20Z) - And/or trade-off in artificial neurons: impact on adversarial robustness [91.3755431537592]
Presence of sufficient number of OR-like neurons in a network can lead to classification brittleness and increased vulnerability to adversarial attacks.
We define AND-like neurons and propose measures to increase their proportion in the network.
Experimental results on the MNIST dataset suggest that our approach holds promise as a direction for further exploration.
arXiv Detail & Related papers (2021-02-15T08:19:05Z) - Artificial Neural Variability for Deep Learning: On Overfitting, Noise
Memorization, and Catastrophic Forgetting [135.0863818867184]
artificial neural variability (ANV) helps artificial neural networks learn some advantages from natural'' neural networks.
ANV plays as an implicit regularizer of the mutual information between the training data and the learned model.
It can effectively relieve overfitting, label noise memorization, and catastrophic forgetting at negligible costs.
arXiv Detail & Related papers (2020-11-12T06:06:33Z) - Non-linear Neurons with Human-like Apical Dendrite Activations [81.18416067005538]
We show that a standard neuron followed by our novel apical dendrite activation (ADA) can learn the XOR logical function with 100% accuracy.
We conduct experiments on six benchmark data sets from computer vision, signal processing and natural language processing.
arXiv Detail & Related papers (2020-02-02T21:09:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.