Fugu-MT 論文翻訳(概要): Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs

論文の概要: Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs

arxiv url: http://arxiv.org/abs/2602.08621v1
Date: Mon, 09 Feb 2026 13:12:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-10 20:26:25.242606
Title: Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs
Title（参考訳）: スパースモデルとスパースセーフ:試験用LLMにおける安全でないルート
Authors: Yukun Jiang, Hai Huang, Mingjie Li, Yage Zhang, Michael Backes, Yang Zhang,
Abstract要約: コンバレーション・オブ・エキスパート(MoE)アーキテクチャは、大規模言語モデルの計算コストを大幅に削減する。しかし、以前の作業は主に実用性と効率に重点を置いており、このスパースアーキテクチャに関連する安全性のリスクは過小評価されている。安全でないルートを発見することで,MoE LLMの安全性はアーキテクチャと同じくらい疎いことを示す。
参考スコア（独自算出の注目度）: 20.93386462211096
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: By introducing routers to selectively activate experts in Transformer layers, the mixture-of-experts (MoE) architecture significantly reduces computational costs in large language models (LLMs) while maintaining competitive performance, especially for models with massive parameters. However, prior work has largely focused on utility and efficiency, leaving the safety risks associated with this sparse architecture underexplored. In this work, we show that the safety of MoE LLMs is as sparse as their architecture by discovering unsafe routes: routing configurations that, once activated, convert safe outputs into harmful ones. Specifically, we first introduce the Router Safety importance score (RoSais) to quantify the safety criticality of each layer's router. Manipulation of only the high-RoSais router(s) can flip the default route into an unsafe one. For instance, on JailbreakBench, masking 5 routers in DeepSeek-V2-Lite increases attack success rate (ASR) by over 4$\times$ to 0.79, highlighting an inherent risk that router manipulation may naturally occur in MoE LLMs. We further propose a Fine-grained token-layer-wise Stochastic Optimization framework to discover more concrete Unsafe Routes (F-SOUR), which explicitly considers the sequentiality and dynamics of input tokens. Across four representative MoE LLM families, F-SOUR achieves an average ASR of 0.90 and 0.98 on JailbreakBench and AdvBench, respectively. Finally, we outline defensive perspectives, including safety-aware route disabling and router training, as promising directions to safeguard MoE LLMs. We hope our work can inform future red-teaming and safeguarding of MoE LLMs. Our code is provided in https://github.com/TrustAIRLab/UnsafeMoE.
Abstract（参考訳）: トランスフォーマー層のエキスパートを選択的に活性化するためにルータを導入することで、Mix-of-experts (MoE)アーキテクチャは大きな言語モデル(LLM)の計算コストを大幅に削減し、特に巨大なパラメータを持つモデルでは競争性能を維持している。しかし、以前の作業は主に実用性と効率に重点を置いており、このスパースアーキテクチャに関連する安全性のリスクは過小評価されている。本研究では,MoE LLMの安全性が,安全でない経路を発見することで,そのアーキテクチャと同等に疎結合であることを示す。具体的には、まず、各層のルータの安全性臨界度を定量化するために、ルータ安全性重要度スコア(RoSais)を導入する。高RoSaisルータのみを操作すると、デフォルトのルートを安全でないルートに切り替えることができる。例えば、JailbreakBenchでは、DeepSeek-V2-Liteの5つのルータをマスクすると、攻撃成功率(ASR)が4$\times$から0.79に上昇する。さらに,入力トークンの逐次性や動的性を明確に考慮した,より具体的なアンセーフルート(F-SOUR)を発見するための,きめ細かいトークン層ワイド確率最適化フレームワークを提案する。 4つの代表的なMoE LLMファミリーの中で、F-SOURは、それぞれJailbreakBenchとAdvBenchで平均0.90と0.98のASRを達成している。最後に、安全に配慮したルートの無効化やルータ訓練など、防衛的視点をMoE LLMの安全を守るための有望な方向として概説する。当社の作業が今後,MoE LLMのレッドチーム化と保護を通知できることを願っています。私たちのコードはhttps://github.com/TrustAIRLab/UnsafeMoEで提供されます。

論文の概要: Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs

関連論文リスト