Related papers: Multilingual Safety Alignment Via Sparse Weight Editing

Multilingual Safety Alignment Via Sparse Weight Editing

URL: http://arxiv.org/abs/2602.22554v1
Date: Thu, 26 Feb 2026 02:46:13 GMT
Title: Multilingual Safety Alignment Via Sparse Weight Editing
Authors: Jiaming Liang, Zhaoxin Wang, Handing Wang,
Abstract summary: We propose a training-free alignment framework based on Sparse Weight Editing.<n>We derive a closed-form solution to optimally map the harmful representations of LRLs to the robust safety subspaces of HRLs.<n>Our method substantially reduces Attack Success Rate (ASR) in LRLs with negligible impact on general reasoning capabilities.
Score: 11.684928396991742
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) exhibit significant safety disparities across languages, with low-resource languages (LRLs) often bypassing safety guardrails established for high-resource languages (HRLs) like English. Existing solutions, such as multilingual supervised fine-tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF), are computationally expensive and dependent on scarce multilingual safety data. In this work, we propose a novel, training-free alignment framework based on Sparse Weight Editing. Identifying that safety capabilities are localized within a sparse set of safety neurons, we formulate the cross-lingual alignment problem as a constrained linear transformation. We derive a closed-form solution to optimally map the harmful representations of LRLs to the robust safety subspaces of HRLs, while preserving general utility via a null-space projection constraint. Extensive experiments across 8 languages and multiple model families (Llama-3, Qwen-2.5) demonstrate that our method substantially reduces Attack Success Rate (ASR) in LRLs with negligible impact on general reasoning capabilities, all achieved with a single, data-efficient calculation.

Related papers

Lingua-SafetyBench: A Benchmark for Safety Evaluation of Multilingual Vision-Language Models [54.10540442330978]
Existing benchmarks are typically multilingual but text-only, or multimodal but monolingual.<n>Recent multilingual red-teaming efforts render harmful prompts into images, yet rely heavily on typography-style visuals.<n>We introduce a benchmark of 100,440 harmful image-text pairs across 10 languages, explicitly partitioned into image-dominant and text-dominant subsets.
arXiv Detail & Related papers (2026-01-30T09:18:13Z)
Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models [50.89022445197919]
Large Audio Language Models (LALMs) have extended the capabilities of Large Language Models (LLMs)<n>Recent research has revealed that LALMs remain vulnerable to harmful queries due to insufficient safety-alignment.
arXiv Detail & Related papers (2025-05-26T08:25:25Z)
MPO: Multilingual Safety Alignment via Reward Gap Optimization [88.76638442683391]
Large language models (LLMs) have become increasingly central to AI applications worldwide.<n>Existing preference learning methods for safety alignment, such as RLHF and DPO, are primarily monolingual and struggle with noisy multilingual data.<n>We introduce Multilingual reward gaP Optimization (MPO), a novel approach that leverages the well-aligned safety capabilities of the dominant language (English) to improve safety alignment across multiple languages.
arXiv Detail & Related papers (2025-05-22T16:24:51Z)
MrGuard: A Multilingual Reasoning Guardrail for Universal LLM Safety [56.77103365251923]
Large Language Models (LLMs) are susceptible to adversarial attacks such as jailbreaking.<n>This vulnerability is exacerbated in multilingual settings, where multilingual safety-aligned data is often limited.<n>We introduce a multilingual guardrail with reasoning for prompt classification.
arXiv Detail & Related papers (2025-04-21T17:15:06Z)
The Hidden Space of Safety: Understanding Preference-Tuned LLMs in Multilingual context [0.9130277390156759]
Alignment tuning has enabled large language models to excel in reasoning, instruction-following, and minimizing harmful generations.<n>Despite their widespread deployment, these models exhibit a monolingual bias, raising concerns about the effectiveness of alignment across languages.<n>Current alignment methods predominantly focus on English, leaving it unclear how alignment mechanism generalizes to multilingual settings.
arXiv Detail & Related papers (2025-04-03T15:46:46Z)
Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment [9.913748282597856]
Soteria locates and minimally adjusts the "functional heads" most responsible for harmful content generation in each language.<n>XThreatBench is a specialized multilingual dataset capturing fine-grained harmful behaviors drawn from real policy guidelines.<n> Experiments with leading open-source LLMs show that Soteria consistently improves safety metrics across high-, mid-, and low-resource languages.
arXiv Detail & Related papers (2025-02-16T19:44:01Z)
One-Shot Safety Alignment for Large Language Models via Optimal Dualization [64.52223677468861]
This paper presents a perspective of dualization that reduces constrained alignment to an equivalent unconstrained alignment problem. We do so by pre-optimizing a smooth and convex dual function that has a closed form. Our strategy leads to two practical algorithms in model-based and preference-based settings.
arXiv Detail & Related papers (2024-05-29T22:12:52Z)
The Language Barrier: Dissecting Safety Challenges of LLMs in Multilingual Contexts [46.089025223336854]
This paper examines the variations in safety challenges faced by large language models across different languages. We compare how state-of-the-art LLMs respond to the same set of malicious prompts written in higher- vs. lower-resource languages.
arXiv Detail & Related papers (2024-01-23T23:12:09Z)
All Languages Matter: On the Multilingual Safety of Large Language Models [96.47607891042523]
We build the first multilingual safety benchmark for large language models (LLMs) XSafety covers 14 kinds of commonly used safety issues across 10 languages that span several language families. We propose several simple and effective prompting methods to improve the multilingual safety of ChatGPT.
arXiv Detail & Related papers (2023-10-02T05:23:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.