Related papers: NeST: Neuron Selective Tuning for LLM Safety

NeST: Neuron Selective Tuning for LLM Safety

URL: http://arxiv.org/abs/2602.16835v1
Date: Wed, 18 Feb 2026 20:01:01 GMT
Title: NeST: Neuron Selective Tuning for LLM Safety
Authors: Sasha Behrouzi, Lichao Wu, Mohamadreza Rostami, Ahmad-Reza Sadeghi,
Abstract summary: Safety alignment is essential for the responsible deployment of large language models (LLMs)<n>We present NeST, a lightweight, structure-aware safety alignment framework that strengthens refusal behavior by selectively adapting a small subset of safety-relevant neurons.<n>We benchmark NeST against three dominant baselines: full fine-tuning, LoRA-based fine-tuning, and circuit breakers across 10 open-weight LLMs.
Score: 12.78786094112
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Safety alignment is essential for the responsible deployment of large language models (LLMs). Yet, existing approaches often rely on heavyweight fine-tuning that is costly to update, audit, and maintain across model families. Full fine-tuning incurs substantial computational and storage overhead, while parameter-efficient methods such as LoRA trade efficiency for inconsistent safety gains and sensitivity to design choices. Safety intervention mechanisms such as circuit breakers reduce unsafe outputs without modifying model weights, but do not directly shape or preserve the internal representations that govern safety behavior. These limitations hinder rapid and reliable safety updates, particularly in settings where models evolve frequently or must adapt to new policies and domains. We present NeST, a lightweight, structure-aware safety alignment framework that strengthens refusal behavior by selectively adapting a small subset of safety-relevant neurons while freezing the remainder of the model. NeST aligns parameter updates with the internal organization of safety behavior by clustering functionally coherent safety neurons and enforcing shared updates within each cluster, enabling targeted and stable safety adaptation without broad model modification or inference-time overhead. We benchmark NeST against three dominant baselines: full fine-tuning, LoRA-based fine-tuning, and circuit breakers across 10 open-weight LLMs spanning multiple model families and sizes. Across all evaluated models, NeST reduces the attack success rate from an average of 44.5% to 4.36%, corresponding to a 90.2% reduction in unsafe generations, while requiring only 0.44 million trainable parameters on average. This amounts to a 17,310x decrease in updated parameters compared to full fine-tuning and a 9.25x reduction relative to LoRA, while consistently achieving stronger safety performance for alignment.

Related papers

SafeNeuron: Neuron-Level Safety Alignment for Large Language Models [71.50117566279185]
We propose SafeNeuron, a neuron-level safety alignment framework that improves robustness by redistributing safety representations across the network.<n>In experiments, SafeNeuron significantly improves robustness against neuron pruning attacks, reduces the risk of open-source models being repurposed as red-team generators, and preserves general capabilities.
arXiv Detail & Related papers (2026-02-12T16:40:05Z)
Q-realign: Piggybacking Realignment on Quantization for Safe and Efficient LLM Deployment [55.14890249389052]
Existing defenses either embed safety recovery into fine-tuning or rely on fine-tuning-derived priors for post-hoc correction.<n>We propose textttQ-realign, a post-hoc defense method based on post-training quantization.<n>Our work provides a practical, turnkey solution for safety-aware deployment.
arXiv Detail & Related papers (2026-01-13T00:07:24Z)
GateBreaker: Gate-Guided Attacks on Mixture-of-Expert LLMs [24.327693899810615]
We present GateBreaker, the first training-free, lightweight, and architecture-agnostic attack framework.<n>GateBreaker compromises the safety alignment of modern MoE LLMs at inference time.<n>Our study shows that MoE safety concentrates within a small subset of neurons coordinated by sparse routing.
arXiv Detail & Related papers (2025-12-24T07:13:24Z)
A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space [91.99501941169831]
GuardSpace is a guardrail framework for preserving safety alignment throughout fine-tuning.<n>For Llama-2-7B-Chat fine-tuned on GSM8K, GuardSpace outperforms the state-of-the-art method AsFT.
arXiv Detail & Related papers (2025-10-16T04:57:53Z)
UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models [67.91151588917396]
Large Language Models (LLMs) have achieved remarkable progress across a wide range of tasks, but remain vulnerable to safety risks such as harmful content generation and jailbreak attacks.<n>We propose UpSafe$circ$C, a unified framework for enhancing LLM safety through safety-aware upcycling.<n>Our results highlight a new direction for LLM safety: moving from static alignment toward dynamic, modular, and inference-aware control.
arXiv Detail & Related papers (2025-10-02T16:43:33Z)
Rethinking Safety in LLM Fine-tuning: An Optimization Perspective [56.31306558218838]
We show that poor optimization choices, rather than inherent trade-offs, often cause safety problems, measured as harmful responses to adversarial prompts.<n>We propose a simple exponential moving average (EMA) momentum technique in parameter space that preserves safety performance.<n>Our experiments on the Llama families across multiple datasets demonstrate that safety problems can largely be avoided without specialized interventions.
arXiv Detail & Related papers (2025-08-17T23:46:36Z)
Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks [22.059668583508365]
We propose the Fine-Grained Safety Neurons (FGSN) with Training-Free Continual Projection method to reduce the fine-tuning safety risks.<n>FGSN inherently integrates the multi-scale interactions between safety layers and neurons, localizing sparser and more precise fine-grained safety neurons.
arXiv Detail & Related papers (2025-08-08T03:20:25Z)
LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning [61.594212398272184]
Low-Rank Extrapolation (LoX) improves robustness against benign and malicious fine-tuning attacks.<n>LoX leads to 11% to 54% absolute reductions in attack success rates.
arXiv Detail & Related papers (2025-06-18T16:30:02Z)
AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin [38.577959886489076]
Large language models (LLMs) are vulnerable to safety risks during fine-tuning.<n>We propose a methodology for safety fine-tuning called AsFT (Anchoring Safety in Fine-Tuning)
arXiv Detail & Related papers (2025-06-10T05:59:48Z)
Disentangled Safety Adapters Enable Efficient Guardrails and Flexible Inference-Time Alignment [4.181987990532721]
Existing paradigms for ensuring AI safety, such as guardrail models and alignment training, often compromise either inference efficiency or development flexibility.<n>We introduce Disentangled Safety Adapters (DSA), a novel framework addressing these challenges by decoupling safety-specific computations from a task-optimized base model.<n>DSA utilizes lightweight adapters that leverage the base model's internal representations, enabling diverse and flexible safety functionalities with minimal impact on inference cost.
arXiv Detail & Related papers (2025-05-30T19:11:52Z)
NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning [37.024666077902225]
A handful of malicious data uploaded by users can subtly manipulate the finetuning process, resulting in an alignment-broken model.<n>Existing methods to counteract fine-tuning attacks typically require substantial computational resources.<n>We propose textbfNeuron-textbfLevel textbfSafety textbfRealignment.
arXiv Detail & Related papers (2024-12-17T02:59:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.