Fugu-MT 論文翻訳(概要): Multilingual Safety Alignment via Self-Distillation

論文の概要: Multilingual Safety Alignment via Self-Distillation

arxiv url: http://arxiv.org/abs/2605.02971v2
Date: Thu, 07 May 2026 19:25:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 16:31:22.732455
Title: Multilingual Safety Alignment via Self-Distillation
Title（参考訳）: 自己蒸留による多言語安全アライメント
Authors: Ruiyang Qin, Qingzhuo Wang, Dongrui Liu, Qiang Li, Zhihua Wei, Wen Shen,
Abstract要約: 大規模言語モデル (LLM) は、重度の多言語的安全性のミスアライメントを示す。マルチリンガル自己蒸留(Multilingual Self-Distillation:MSD)という,言語横断型セーフガード転送フレームワークを提案する。私たちのフレームワークは柔軟で、さまざまな自己蒸留戦略に統合できます。
参考スコア（独自算出の注目度）: 17.94152626632751
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) exhibit severe multilingual safety misalignment: they possess strong safeguards in high-resource languages but remain highly vulnerable to jailbreak attacks in low-resource languages. Current safety alignment methods generally rely on high-quality response data for each target language, which is expensive and difficult to generate. In this paper, we propose a cross-lingual safeguard transfer framework named Multilingual Self-Distillation (MSD). This framework transfers an LLM's inherent safety capabilities from high-resource (e.g., English) to low-resource (e.g., Javanese) languages, overcoming the need for response data in any language. Our framework is flexible and can be integrated with different self-distillation strategies. Specifically, we implement two concrete methods -- on-policy MSD and off-policy MSD -- both of which enable effective cross-lingual safety transfer using only multilingual queries. Furthermore, we propose Dual-Perspective Safety Weighting (DPSW), a divergence measure to optimize the distillation objective. By jointly considering the perspectives of both the teacher and the student, DPSW adaptively increases the penalty weights on safety-critical tokens while reducing the weights on non-critical tokens. Extensive experiments on representative LLMs across diverse multilingual jailbreak and utility benchmarks demonstrate that our method consistently achieves superior multilingual safety performance. Notably, it generalizes effectively to more challenging datasets and unseen languages while preserving the model's general capabilities.
Abstract（参考訳）: 大規模な言語モデル(LLMs)は、多言語間の安全上の重大なミスアライメントを示す。現在の安全アライメント手法は、一般的にターゲット言語ごとに高品質な応答データに依存しており、これは高価で生成が困難である。本稿では,多言語自己蒸留(Multilingual Self-Distillation:MSD)という,言語横断型セーフガード転送フレームワークを提案する。このフレームワークは、LLM固有の安全性機能を、高リソース(例、英語)から低リソース(例、Javanese)言語に移行し、あらゆる言語での応答データの必要性を克服します。私たちのフレームワークは柔軟で、さまざまな自己蒸留戦略に統合できます。具体的には、多言語クエリのみを使用して効果的な言語間安全転送を可能にする2つの具体的手法、すなわち、オン・ポリティィMSDとオフ・ポリティィMSDを実装した。さらに,蒸留目標を最適化するための分散度尺度であるDual-Perspective Safety Weighting (DPSW)を提案する。教師と学生の双方の視点を共同で考えることで、DPSWは安全クリティカルトークンのペナルティ重みを適応的に増加させ、非クリティカルトークンのペナルティ重みを減少させる。多様なマルチリンガルジェイルブレイクとユーティリティベンチマークを対象とするLLMの大規模実験により,本手法が優れたマルチリンガル安全性を実現することを示す。特に、モデルの一般的な機能を維持しながら、より困難なデータセットや目に見えない言語に効果的に一般化する。

論文の概要: Multilingual Safety Alignment via Self-Distillation

関連論文リスト