Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment
- URL: http://arxiv.org/abs/2502.11244v1
- Date: Sun, 16 Feb 2025 19:44:01 GMT
- Title: Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment
- Authors: Somnath Banerjee, Sayan Layek, Pratyush Chatterjee, Animesh Mukherjee, Rima Hazra,
- Abstract summary: Soteria locates and minimally adjusts the "functional heads" most responsible for harmful content generation in each language.
XThreatBench is a specialized multilingual dataset capturing fine-grained harmful behaviors drawn from real policy guidelines.
Experiments with leading open-source LLMs show that Soteria consistently improves safety metrics across high-, mid-, and low-resource languages.
- Score: 4.368725325557961
- License:
- Abstract: Ensuring consistent safety across multiple languages remains a significant challenge for large language models (LLMs). We introduce Soteria, a lightweight yet powerful strategy that locates and minimally adjusts the "functional heads" most responsible for harmful content generation in each language. By altering only a fraction of parameters, Soteria drastically reduces policy violations without sacrificing overall model performance, even in low-resource settings. To rigorously evaluate our approach, we also present XThreatBench, a specialized multilingual dataset capturing fine-grained harmful behaviors drawn from real policy guidelines. Experiments with leading open-source LLMs (e.g., Llama, Qwen, Mistral) show that Soteria consistently improves safety metrics across high-, mid-, and low-resource languages. These findings highlight a promising path toward scalable, linguistically attuned, and ethically aligned LLMs worldwide.
Related papers
- LLMs Lost in Translation: M-ALERT uncovers Cross-Linguistic Safety Gaps [63.10843814055688]
M-ALERT is a benchmark that evaluates the safety of Large Language Models in five languages: English, French, German, Italian, and Spanish.
M-ALERT includes 15k high-quality prompts per language, totaling 75k, following the detailed ALERT taxonomy.
arXiv Detail & Related papers (2024-12-19T16:46:54Z) - Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks [18.208272960774337]
Large Language Models (LLMs) have sparked widespread concerns about their safety.
Recent work demonstrates that safety alignment of LLMs can be easily removed by fine-tuning.
We take a further step to understand fine-tuning attacks in multilingual LLMs.
arXiv Detail & Related papers (2024-10-23T18:27:36Z) - Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture [6.17896401271963]
We introduce Multilingual Blending, a mixed-language query-response scheme designed to evaluate the safety alignment of various large language models.
We investigate language patterns such as language availability, morphology, and language family that could impact the effectiveness of Multilingual Blending.
arXiv Detail & Related papers (2024-07-10T03:26:15Z) - Safe Multi-agent Reinforcement Learning with Natural Language Constraints [49.01100552946231]
The role of natural language constraints in Safe Multi-agent Reinforcement Learning (MARL) is crucial, yet often overlooked.
We propose a novel approach named Safe Multi-agent Reinforcement Learning with Natural Language constraints (SMALL)
Our method leverages fine-tuned language models to interpret and process free-form textual constraints, converting them into semantic embeddings.
These embeddings are then integrated into the multi-agent policy learning process, enabling agents to learn policies that minimize constraint violations while optimizing rewards.
arXiv Detail & Related papers (2024-05-30T12:57:35Z) - CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion [117.178835165855]
This paper introduces CodeAttack, a framework that transforms natural language inputs into code inputs.
Our studies reveal a new and universal safety vulnerability of these models against code input.
We find that a larger distribution gap between CodeAttack and natural language leads to weaker safety generalization.
arXiv Detail & Related papers (2024-03-12T17:55:38Z) - The Language Barrier: Dissecting Safety Challenges of LLMs in
Multilingual Contexts [46.089025223336854]
This paper examines the variations in safety challenges faced by large language models across different languages.
We compare how state-of-the-art LLMs respond to the same set of malicious prompts written in higher- vs. lower-resource languages.
arXiv Detail & Related papers (2024-01-23T23:12:09Z) - Multilingual Jailbreak Challenges in Large Language Models [96.74878032417054]
In this study, we reveal the presence of multilingual jailbreak challenges within large language models (LLMs)
We consider two potential risky scenarios: unintentional and intentional.
We propose a novel textscSelf-Defense framework that automatically generates multilingual training data for safety fine-tuning.
arXiv Detail & Related papers (2023-10-10T09:44:06Z) - All Languages Matter: On the Multilingual Safety of Large Language Models [96.47607891042523]
We build the first multilingual safety benchmark for large language models (LLMs)
XSafety covers 14 kinds of commonly used safety issues across 10 languages that span several language families.
We propose several simple and effective prompting methods to improve the multilingual safety of ChatGPT.
arXiv Detail & Related papers (2023-10-02T05:23:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.