Related papers: Who Transfers Safety? Identifying and Targeting Cross-Lingual Shared Safety Neurons

Who Transfers Safety? Identifying and Targeting Cross-Lingual Shared Safety Neurons

URL: http://arxiv.org/abs/2602.01283v1
Date: Sun, 01 Feb 2026 15:28:02 GMT
Title: Who Transfers Safety? Identifying and Targeting Cross-Lingual Shared Safety Neurons
Authors: Xianhui Zhang, Chengyu Xie, Linxia Zhu, Yonghui Yang, Weixiang Zhao, Zifeng Cheng, Cong Wang, Fei Shen, Tat-Seng Chua,
Abstract summary: Cross-lingual shared safety neurons (SS-Neurons) regulate safety behavior across languages.<n>We propose a neuron-oriented training strategy that targets SS-Neurons based on language resource distribution and model architecture.
Score: 49.772147495578736
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multilingual safety remains significantly imbalanced, leaving non-high-resource (NHR) languages vulnerable compared to robust high-resource (HR) ones. Moreover, the neural mechanisms driving safety alignment remain unclear despite observed cross-lingual representation transfer. In this paper, we find that LLMs contain a set of cross-lingual shared safety neurons (SS-Neurons), a remarkably small yet critical neuronal subset that jointly regulates safety behavior across languages. We first identify monolingual safety neurons (MS-Neurons) and validate their causal role in safety refusal behavior through targeted activation and suppression. Our cross-lingual analyses then identify SS-Neurons as the subset of MS-Neurons shared between HR and NHR languages, serving as a bridge to transfer safety capabilities from HR to NHR domains. We observe that suppressing these neurons causes concurrent safety drops across NHR languages, whereas reinforcing them improves cross-lingual defensive consistency. Building on these insights, we propose a simple neuron-oriented training strategy that targets SS-Neurons based on language resource distribution and model architecture. Experiments demonstrate that fine-tuning this tiny neuronal subset outperforms state-of-the-art methods, significantly enhancing NHR safety while maintaining the model's general capabilities. The code and dataset will be available athttps://github.com/1518630367/SS-Neuron-Expansion.

Related papers

Multilingual Safety Alignment Via Sparse Weight Editing [11.684928396991742]
We propose a training-free alignment framework based on Sparse Weight Editing.<n>We derive a closed-form solution to optimally map the harmful representations of LRLs to the robust safety subspaces of HRLs.<n>Our method substantially reduces Attack Success Rate (ASR) in LRLs with negligible impact on general reasoning capabilities.
arXiv Detail & Related papers (2026-02-26T02:46:13Z)
Robust Spiking Neural Networks Against Adversarial Attacks [49.08210314590693]
Spiking Neural Networks (SNNs) represent a promising paradigm for energy-efficient neuromorphic computing.<n>In this study, we theoretically demonstrate that threshold-neighboring spiking neurons are the key factors limiting the robustness of directly trained SNNs.<n>We find that these neurons set the upper limits for the maximum potential strength of adversarial attacks and are prone to state-flipping under minor disturbances.
arXiv Detail & Related papers (2026-02-24T05:06:12Z)
SafeNeuron: Neuron-Level Safety Alignment for Large Language Models [71.50117566279185]
We propose SafeNeuron, a neuron-level safety alignment framework that improves robustness by redistributing safety representations across the network.<n>In experiments, SafeNeuron significantly improves robustness against neuron pruning attacks, reduces the risk of open-source models being repurposed as red-team generators, and preserves general capabilities.
arXiv Detail & Related papers (2026-02-12T16:40:05Z)
Lingua-SafetyBench: A Benchmark for Safety Evaluation of Multilingual Vision-Language Models [54.10540442330978]
Existing benchmarks are typically multilingual but text-only, or multimodal but monolingual.<n>Recent multilingual red-teaming efforts render harmful prompts into images, yet rely heavily on typography-style visuals.<n>We introduce a benchmark of 100,440 harmful image-text pairs across 10 languages, explicitly partitioned into image-dominant and text-dominant subsets.
arXiv Detail & Related papers (2026-01-30T09:18:13Z)
Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks [22.059668583508365]
We propose the Fine-Grained Safety Neurons (FGSN) with Training-Free Continual Projection method to reduce the fine-tuning safety risks.<n>FGSN inherently integrates the multi-scale interactions between safety layers and neurons, localizing sparser and more precise fine-grained safety neurons.
arXiv Detail & Related papers (2025-08-08T03:20:25Z)
NeuRel-Attack: Neuron Relearning for Safety Disalignment in Large Language Models [14.630626774362606]
Safety alignment in large language models (LLMs) is achieved through fine-tuning mechanisms that regulate neuron activations to suppress harmful content.<n>We propose a novel approach to induce disalignment by identifying and modifying the neurons responsible for safety constraints.
arXiv Detail & Related papers (2025-04-29T05:49:35Z)
Language-specific Neurons Do Not Facilitate Cross-Lingual Transfer [21.205821852762362]
Existing techniques to identify language-specific neurons can be leveraged to enhance cross-lingual task performance of lowresource languages.<n>We find that such neuron-specific interventions are insufficient to yield cross-lingual improvements on downstream tasks.
arXiv Detail & Related papers (2025-03-21T18:08:11Z)
Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons [57.07507194465299]
Large language models (LLMs) excel in various capabilities but pose safety risks such as generating harmful content and misinformation, even after safety alignment.<n>We focus on identifying and analyzing safety neurons within LLMs that are responsible for safety behaviors.<n>We propose inference-time activation contrasting to locate these neurons and dynamic activation patching to evaluate their causal effects on model safety.
arXiv Detail & Related papers (2024-06-20T09:35:22Z)
Sharing Matters: Analysing Neurons Across Languages and Tasks in LLMs [85.0284555835015]
Large language models (LLMs) have revolutionized the field of natural language processing (NLP)<n>Few studies have attempted to explore the internal workings of LLMs in multilingual settings.<n>We classify neurons into four distinct categories based on their responses to a specific input across different languages.
arXiv Detail & Related papers (2024-06-13T16:04:11Z)
Backdoor Attack on Multilingual Machine Translation [53.28390057407576]
multilingual machine translation (MNMT) systems have security vulnerabilities. An attacker injects poisoned data into a low-resource language pair to cause malicious translations in other languages. This type of attack is of particular concern, given the larger attack surface of languages inherent to low-resource settings.
arXiv Detail & Related papers (2024-04-03T01:32:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.