Related papers: SafeNeuron: Neuron-Level Safety Alignment for Large Language Models

SafeNeuron: Neuron-Level Safety Alignment for Large Language Models

URL: http://arxiv.org/abs/2602.12158v1
Date: Thu, 12 Feb 2026 16:40:05 GMT
Title: SafeNeuron: Neuron-Level Safety Alignment for Large Language Models
Authors: Zhaoxin Wang, Jiaming Liang, Fengbin Zhu, Weixiang Zhao, Junfeng Fang, Jiayi Ji, Handing Wang, Tat-Seng Chua,
Abstract summary: We propose SafeNeuron, a neuron-level safety alignment framework that improves robustness by redistributing safety representations across the network.<n>In experiments, SafeNeuron significantly improves robustness against neuron pruning attacks, reduces the risk of open-source models being repurposed as red-team generators, and preserves general capabilities.
Score: 71.50117566279185
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) and multimodal LLMs are typically safety-aligned before release to prevent harmful content generation. However, recent studies show that safety behaviors are concentrated in a small subset of parameters, making alignment brittle and easily bypassed through neuron-level attacks. Moreover, most existing alignment methods operate at the behavioral level, offering limited control over the model's internal safety mechanisms. In this work, we propose SafeNeuron, a neuron-level safety alignment framework that improves robustness by redistributing safety representations across the network. SafeNeuron first identifies safety-related neurons, then freezes these neurons during preference optimization to prevent reliance on sparse safety pathways and force the model to construct redundant safety representations. Extensive experiments across models and modalities demonstrate that SafeNeuron significantly improves robustness against neuron pruning attacks, reduces the risk of open-source models being repurposed as red-team generators, and preserves general capabilities. Furthermore, our layer-wise analysis reveals that safety behaviors are governed by stable and shared internal representations. Overall, SafeNeuron provides an interpretable and robust perspective for model alignment.

Related papers

Who Transfers Safety? Identifying and Targeting Cross-Lingual Shared Safety Neurons [49.772147495578736]
Cross-lingual shared safety neurons (SS-Neurons) regulate safety behavior across languages.<n>We propose a neuron-oriented training strategy that targets SS-Neurons based on language resource distribution and model architecture.
arXiv Detail & Related papers (2026-02-01T15:28:02Z)
NeuronTune: Fine-Grained Neuron Modulation for Balanced Safety-Utility Alignment in LLMs [19.133502330591092]
We propose NeuronTune, a fine-grained framework that dynamically modulates sparse neurons to achieve simultaneous safety-utility optimization.<n>Our approach first identifies safety-critical and utility-preserving neurons across all layers via attribution, then employs meta-learning to adaptively amplify safety-neuron activations and suppress utility-neuron activations.
arXiv Detail & Related papers (2025-08-13T04:05:28Z)
Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks [22.059668583508365]
We propose the Fine-Grained Safety Neurons (FGSN) with Training-Free Continual Projection method to reduce the fine-tuning safety risks.<n>FGSN inherently integrates the multi-scale interactions between safety layers and neurons, localizing sparser and more precise fine-grained safety neurons.
arXiv Detail & Related papers (2025-08-08T03:20:25Z)
Shape it Up! Restoring LLM Safety during Finetuning [65.75757313781104]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z)
SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning [76.56522719330911]
Large Reasoning Models (LRMs) introduce a new generation paradigm of explicitly reasoning before answering.<n>LRMs pose great safety risks against harmful queries and adversarial attacks.<n>We propose SafeKey to better activate the safety aha moment in the key sentence.
arXiv Detail & Related papers (2025-05-22T03:46:03Z)
NeuRel-Attack: Neuron Relearning for Safety Disalignment in Large Language Models [14.630626774362606]
Safety alignment in large language models (LLMs) is achieved through fine-tuning mechanisms that regulate neuron activations to suppress harmful content.<n>We propose a novel approach to induce disalignment by identifying and modifying the neurons responsible for safety constraints.
arXiv Detail & Related papers (2025-04-29T05:49:35Z)
SafeSwitch: Steering Unsafe LLM Behavior via Internal Activation Signals [51.49737867797442]
Large language models (LLMs) exhibit exceptional capabilities across various tasks but also pose risks by generating harmful content.<n>We show that LLMs can similarly perform internal assessments about safety in their internal states.<n>We propose SafeSwitch, a framework that regulates unsafe outputs by utilizing the prober-based internal state monitor.
arXiv Detail & Related papers (2025-02-03T04:23:33Z)
NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning [37.024666077902225]
A handful of malicious data uploaded by users can subtly manipulate the finetuning process, resulting in an alignment-broken model.<n>Existing methods to counteract fine-tuning attacks typically require substantial computational resources.<n>We propose textbfNeuron-textbfLevel textbfSafety textbfRealignment.
arXiv Detail & Related papers (2024-12-17T02:59:04Z)
Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons [57.07507194465299]
Large language models (LLMs) excel in various capabilities but pose safety risks such as generating harmful content and misinformation, even after safety alignment.<n>We focus on identifying and analyzing safety neurons within LLMs that are responsible for safety behaviors.<n>We propose inference-time activation contrasting to locate these neurons and dynamic activation patching to evaluate their causal effects on model safety.
arXiv Detail & Related papers (2024-06-20T09:35:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.