Layer-wise Swapping for Generalizable Multilingual Safety
- URL: http://arxiv.org/abs/2601.22620v1
- Date: Fri, 30 Jan 2026 06:22:02 GMT
- Title: Layer-wise Swapping for Generalizable Multilingual Safety
- Authors: Hyunseo Shin, Wonseok Hwang,
- Abstract summary: Existing safety datasets are predominantly English centric, limiting progress in multilingual safety alignment.<n>We propose a safety aware layer swapping method that transfers safety alignment from an English safety expert to low resource language experts without additional training.
- Score: 8.658596218544773
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the rapid advancements of Large Language Models (LLMs), safety risks remain a critical challenge for low-resource languages. Existing safety datasets are predominantly English centric, limiting progress in multilingual safety alignment. As a result, low resource expert models, finetuned on their respective instruction datasets, tend to exhibit higher unsafety rates compared to their high resource counterparts. In this work, we propose a safety aware layer swapping method that transfers safety alignment from an English safety expert to low resource language experts without additional training. To further enhance transfer ability, our method adaptively selects or blends modules based on their degree of specialization. Our approach preserves performance on general language understanding tasks while enhancing safety in the target languages. Experimental results show that the proposed method achieves comparable performance to the language expert on general benchmarks such as MMMLU, BELEBELE, and MGSM, while producing more aligned and less harmful responses on the MultiJail safety benchmark.
Related papers
- Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages [8.667909336164465]
Large language models (LLMs) are being deployed across the Global South.<n> Everyday use involves low-resource languages, code-mixing, and culturally specific norms.<n>Our aim is to make multilingual safety a core requirement-not an add-on-for equitable AI in underrepresented regions.
arXiv Detail & Related papers (2026-02-14T19:56:40Z) - LinguaSafe: A Comprehensive Multilingual Safety Benchmark for Large Language Models [22.273388934888278]
Our dataset comprises 45k entries in 12 languages, ranging from Hungarian to Malay.<n>Our benchmark provides a comprehensive suite of metrics for in-depth safety evaluation.
arXiv Detail & Related papers (2025-08-18T08:59:01Z) - MPO: Multilingual Safety Alignment via Reward Gap Optimization [88.76638442683391]
Large language models (LLMs) have become increasingly central to AI applications worldwide.<n>Existing preference learning methods for safety alignment, such as RLHF and DPO, are primarily monolingual and struggle with noisy multilingual data.<n>We introduce Multilingual reward gaP Optimization (MPO), a novel approach that leverages the well-aligned safety capabilities of the dominant language (English) to improve safety alignment across multiple languages.
arXiv Detail & Related papers (2025-05-22T16:24:51Z) - MrGuard: A Multilingual Reasoning Guardrail for Universal LLM Safety [56.77103365251923]
Large Language Models (LLMs) are susceptible to adversarial attacks such as jailbreaking.<n>This vulnerability is exacerbated in multilingual settings, where multilingual safety-aligned data is often limited.<n>We introduce a multilingual guardrail with reasoning for prompt classification.
arXiv Detail & Related papers (2025-04-21T17:15:06Z) - Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment [9.913748282597856]
Soteria locates and minimally adjusts the "functional heads" most responsible for harmful content generation in each language.<n>XThreatBench is a specialized multilingual dataset capturing fine-grained harmful behaviors drawn from real policy guidelines.<n> Experiments with leading open-source LLMs show that Soteria consistently improves safety metrics across high-, mid-, and low-resource languages.
arXiv Detail & Related papers (2025-02-16T19:44:01Z) - Multimodal Situational Safety [73.63981779844916]
We present the first evaluation and analysis of a novel safety challenge termed Multimodal Situational Safety.<n>For an MLLM to respond safely, whether through language or action, it often needs to assess the safety implications of a language query within its corresponding visual context.<n>We develop the Multimodal Situational Safety benchmark (MSSBench) to assess the situational safety performance of current MLLMs.
arXiv Detail & Related papers (2024-10-08T16:16:07Z) - ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming [64.86326523181553]
ALERT is a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy.
It aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models.
arXiv Detail & Related papers (2024-04-06T15:01:47Z) - All Languages Matter: On the Multilingual Safety of Large Language Models [96.47607891042523]
We build the first multilingual safety benchmark for large language models (LLMs)
XSafety covers 14 kinds of commonly used safety issues across 10 languages that span several language families.
We propose several simple and effective prompting methods to improve the multilingual safety of ChatGPT.
arXiv Detail & Related papers (2023-10-02T05:23:34Z) - SafetyBench: Evaluating the Safety of Large Language Models [54.878612385780805]
SafetyBench is a comprehensive benchmark for evaluating the safety of Large Language Models (LLMs)
It comprises 11,435 diverse multiple choice questions spanning across 7 distinct categories of safety concerns.
Our tests over 25 popular Chinese and English LLMs in both zero-shot and few-shot settings reveal a substantial performance advantage for GPT-4 over its counterparts.
arXiv Detail & Related papers (2023-09-13T15:56:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.