Related papers: Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs

Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs

URL: http://arxiv.org/abs/2602.08621v1
Date: Mon, 09 Feb 2026 13:12:54 GMT
Title: Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs
Authors: Yukun Jiang, Hai Huang, Mingjie Li, Yage Zhang, Michael Backes, Yang Zhang,
Abstract summary: Combination-of-experts (MoE) architecture significantly reduces computational costs in large language models.<n>However, prior work has largely focused on utility and efficiency, leaving the safety risks associated with this sparse architecture underexplored.<n>We show that the safety of MoE LLMs is as sparse as their architecture by discovering unsafe routes.
Score: 20.93386462211096
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: By introducing routers to selectively activate experts in Transformer layers, the mixture-of-experts (MoE) architecture significantly reduces computational costs in large language models (LLMs) while maintaining competitive performance, especially for models with massive parameters. However, prior work has largely focused on utility and efficiency, leaving the safety risks associated with this sparse architecture underexplored. In this work, we show that the safety of MoE LLMs is as sparse as their architecture by discovering unsafe routes: routing configurations that, once activated, convert safe outputs into harmful ones. Specifically, we first introduce the Router Safety importance score (RoSais) to quantify the safety criticality of each layer's router. Manipulation of only the high-RoSais router(s) can flip the default route into an unsafe one. For instance, on JailbreakBench, masking 5 routers in DeepSeek-V2-Lite increases attack success rate (ASR) by over 4$\times$ to 0.79, highlighting an inherent risk that router manipulation may naturally occur in MoE LLMs. We further propose a Fine-grained token-layer-wise Stochastic Optimization framework to discover more concrete Unsafe Routes (F-SOUR), which explicitly considers the sequentiality and dynamics of input tokens. Across four representative MoE LLM families, F-SOUR achieves an average ASR of 0.90 and 0.98 on JailbreakBench and AdvBench, respectively. Finally, we outline defensive perspectives, including safety-aware route disabling and router training, as promising directions to safeguard MoE LLMs. We hope our work can inform future red-teaming and safeguarding of MoE LLMs. Our code is provided in https://github.com/TrustAIRLab/UnsafeMoE.

Related papers

RerouteGuard: Understanding and Mitigating Adversarial Risks for LLM Routing [20.559596977062146]
LLM routers are vulnerable to adversarial attacks in the form of LLM rerouting.<n>We introduce RerouteGuard, a flexible and scalable guardrail framework for LLM rerouting.<n>RerouteGuard achieves over 99% detection accuracy against state-of-the-art rerouting attacks.
arXiv Detail & Related papers (2026-01-29T08:17:08Z)
SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models [66.71948519280669]
Multimodal Large Reasoning Models (MLRMs) demonstrate impressive crossmodal reasoning but often amplify safety risks under adversarial prompts.<n> Existing defenses mainly act at the output level and do not constrain the reasoning process, leaving models to implicit risks.<n>We propose SaFeR-VLM, which integrates four components and supports dynamic and interpretable safety decisions beyond surface-level filtering.
arXiv Detail & Related papers (2025-10-08T10:39:12Z)
UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models [67.91151588917396]
Large Language Models (LLMs) have achieved remarkable progress across a wide range of tasks, but remain vulnerable to safety risks such as harmful content generation and jailbreak attacks.<n>We propose UpSafe$circ$C, a unified framework for enhancing LLM safety through safety-aware upcycling.<n>Our results highlight a new direction for LLM safety: moving from static alignment toward dynamic, modular, and inference-aware control.
arXiv Detail & Related papers (2025-10-02T16:43:33Z)
Defending MoE LLMs against Harmful Fine-Tuning via Safety Routing Alignment [15.402485173557352]
We propose SafeMoE, a safe fine-tuning method tailored to large language models (LLMs)<n>SafeMoE directly mitigates routing drift by penalizing the gap between the routing weights of a fine-tuned model and those of the initial safety-aligned model.<n> Experiments show that SafeMoE effectively mitigates HFT attacks, reducing the harmfulness score of OLMoE from 62.0 to 5.0.
arXiv Detail & Related papers (2025-09-26T04:10:32Z)
How Robust Are Router-LLMs? Analysis of the Fragility of LLM Routing Capabilities [62.474732677086855]
Large language model (LLM) routing has emerged as a crucial strategy for balancing computational costs with performance.<n>We propose the DSC benchmark: Diverse, Simple, and Categorized, an evaluation framework that categorizes router performance across a broad spectrum of query types.
arXiv Detail & Related papers (2025-03-20T19:52:30Z)
Life-Cycle Routing Vulnerabilities of LLM Router [14.967638451190403]
Large language models (LLMs) have achieved remarkable success in natural language processing, yet their performance and computational costs vary significantly.<n>LLMs routers play a crucial role in dynamically balancing these trade-offs.<n>We present a comprehensive investigation into the life-cycle routing vulnerabilities of LLM routers.
arXiv Detail & Related papers (2025-03-09T06:00:35Z)
SafeSwitch: Steering Unsafe LLM Behavior via Internal Activation Signals [51.49737867797442]
Large language models (LLMs) exhibit exceptional capabilities across various tasks but also pose risks by generating harmful content.<n>We show that LLMs can similarly perform internal assessments about safety in their internal states.<n>We propose SafeSwitch, a framework that regulates unsafe outputs by utilizing the prober-based internal state monitor.
arXiv Detail & Related papers (2025-02-03T04:23:33Z)
Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities [50.980446687774645]
We introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability.<n>Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100% ASR on various open-source LLMs.<n>It exhibits strong attack transferability to closed-source models, achieving 99% ASR on GPT-3.5 and 49% ASR on GPT-4, despite being optimized solely on Llama3.
arXiv Detail & Related papers (2024-10-24T06:36:12Z)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.<n>DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence.
arXiv Detail & Related papers (2024-07-12T09:36:33Z)
MoGU: A Framework for Enhancing Safety of Open-Sourced LLMs While Preserving Their Usability [25.750371424096436]
Large Language Models (LLMs) are increasingly deployed in various applications. Our research finds that existing defense strategies lead LLMs to predominantly adopt a rejection-oriented stance. We introduce the MoGU framework, designed to enhance LLMs' safety while preserving their usability.
arXiv Detail & Related papers (2024-05-23T12:19:59Z)
ROSE Doesn't Do That: Boosting the Safety of Instruction-Tuned Large Language Models with Reverse Prompt Contrastive Decoding [89.0074567748505]
We present reverse prompt contrastive decoding (ROSE), a simple-yet-effective method to boost the safety of existing instruction-tuned LLMs without any additional training. Experiments on 6 safety and 2 general-purpose tasks show that, our ROSE not only brings consistent and significant safety improvements (up to +13.8% safety score) upon 5 types of instruction-tuned LLMs, but also benefits the general-purpose ability of LLMs.
arXiv Detail & Related papers (2024-02-19T06:58:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.