Related papers: SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification

SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification

URL: http://arxiv.org/abs/2506.17368v1
Date: Fri, 20 Jun 2025 15:09:10 GMT
Title: SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification
Authors: Zhenglin Lai, Mengyao Liao, Dong Xu, Zebin Zhao, Zhihang Yuan, Chao Fan, Jianqiang Li, Bingzhe Wu,
Abstract summary: We formalize and systematically study MoE model's positional vulnerability.<n>We present SAFEx, an analytical framework that robustly identifies, characterizes, and validates the safety-critical experts.
Score: 26.937824679384097
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models based on Mixture-of-Experts have achieved substantial gains in efficiency and scalability, yet their architectural uniqueness introduces underexplored safety alignment challenges. Existing safety alignment strategies, predominantly designed for dense models, are ill-suited to address MoE-specific vulnerabilities. In this work, we formalize and systematically study MoE model's positional vulnerability - the phenomenon where safety-aligned behaviors rely on specific expert modules, revealing critical risks inherent to MoE architectures. To this end, we present SAFEx, an analytical framework that robustly identifies, characterizes, and validates the safety-critical experts using a novel Stability-based Expert Selection (SES) algorithm. Notably, our approach enables the explicit decomposition of safety-critical experts into distinct functional groups, including those responsible for harmful content detection and those controlling safe response generation. Extensive experiments on mainstream MoE models, such as the recently released Qwen3-MoE, demonstrated that their intrinsic safety mechanisms heavily rely on a small subset of positional experts. Disabling these experts significantly compromised the models' ability to refuse harmful requests. For Qwen3-MoE with 6144 experts (in the FNN layer), we find that disabling as few as 12 identified safety-critical experts can cause the refusal rate to drop by 22%, demonstrating the disproportionate impact of a small set of experts on overall model safety.

Related papers

BarrierSteer: LLM Safety via Learning Barrier Steering [83.12893815611052]
BarrierSteer is a novel framework that formalizes safety by embedding learned non-linear safety constraints directly into the model's latent representation space.<n>We show that BarrierSteer substantially reduces adversarial success rates, decreases unsafe generations, and outperforms existing methods.
arXiv Detail & Related papers (2026-02-23T18:19:46Z)
What Matters For Safety Alignment? [38.86339753409445]
This paper presents a comprehensive empirical study on the safety alignment capabilities of AI systems.<n>We systematically investigate and compare the influence of six critical intrinsic model characteristics and three external attack techniques.<n>We identify the LRMs GPT-OSS-20B, Qwen3-Next-80B-A3B-Thinking, and GPT-OSS-120B as the top-three safest models.
arXiv Detail & Related papers (2026-01-07T12:31:52Z)
SafeRBench: A Comprehensive Benchmark for Safety Assessment in Large Reasoning Models [60.8821834954637]
We present SafeRBench, the first benchmark that assesses LRM safety end-to-end.<n>We pioneer the incorporation of risk categories and levels into input design.<n>We introduce a micro-thought chunking mechanism to segment long reasoning traces into semantically coherent units.
arXiv Detail & Related papers (2025-11-19T06:46:33Z)
DeepKnown-Guard: A Proprietary Model-Based Safety Response Framework for AI Agents [12.054307827384415]
Large Language Models (LLMs) have become increasingly prominent, severely constraining their trustworthy deployment in critical domains.<n>This paper proposes a novel safety response framework designed to safeguard LLMs at both the input and output levels.
arXiv Detail & Related papers (2025-11-05T03:04:35Z)
Who Speaks for the Trigger? Dynamic Expert Routing in Backdoored Mixture-of-Experts Transformers [12.47462301643593]
Large language models (LLMs) with Mixture-of-Experts (MoE) architectures achieve impressive performance and efficiency by dynamically routing inputs to specializedworks, known as experts.<n>We propose BadSwitch, a novel backdoor framework that integrates task-coupled dynamic trigger optimization with a sensitivity-guided Top-S expert tracing mechanism.
arXiv Detail & Related papers (2025-10-15T12:11:02Z)
SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models [66.71948519280669]
Multimodal Large Reasoning Models (MLRMs) demonstrate impressive crossmodal reasoning but often amplify safety risks under adversarial prompts.<n> Existing defenses mainly act at the output level and do not constrain the reasoning process, leaving models to implicit risks.<n>We propose SaFeR-VLM, which integrates four components and supports dynamic and interpretable safety decisions beyond surface-level filtering.
arXiv Detail & Related papers (2025-10-08T10:39:12Z)
SafeMind: Benchmarking and Mitigating Safety Risks in Embodied LLM Agents [7.975014390527644]
Embodied agents powered by large language models (LLMs) inherit advanced planning capabilities; however, their direct interaction with the physical world exposes them to safety vulnerabilities.<n>We present SafeMindBench, a multimodal benchmark with 5,558 samples spanning four task categories (Instr-Risk, Env-Risk, Order-Fix, Req-Align) across high-risk scenarios such as sabotage, harm, privacy, and illegal behavior.<n>We introduce SafeMindAgent, a modular Planner-Executor architecture integrated with three cascaded safety modules, which incorporate safety constraints into the reasoning process.
arXiv Detail & Related papers (2025-09-30T07:24:04Z)
Steering MoE LLMs via Expert (De)Activation [118.23403783503026]
Mixture-of-Experts (MoE) in Large Language Models (LLMs) routes each token through a subset of specialized Feed-Forward Networks (FFN)<n>We present SteerMoE, a framework for steering MoE models by detecting and controlling behavior-linked experts.
arXiv Detail & Related papers (2025-09-11T17:55:09Z)
AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions [64.85086226439954]
We present SAFE, a benchmark for assessing the safety of embodied VLM agents on hazardous instructions.<n> SAFE comprises three components: SAFE-THOR, SAFE-VERSE, and SAFE-DIAGNOSE.<n>We uncover systematic failures in translating hazard recognition into safe planning and execution.
arXiv Detail & Related papers (2025-06-17T16:37:35Z)
Subtle Risks, Critical Failures: A Framework for Diagnosing Physical Safety of LLMs for Embodied Decision Making [31.555271917529872]
We introduce SAFEL, the framework for systematically evaluating the physical safety of large language models (LLMs) in embodied decision making.<n>We introduce EMBODYGUARD, a PDDL-grounded benchmark containing 942 LLM-generated scenarios covering both overtly malicious and contextually hazardous instructions.<n>Our results highlight critical limitations in current LLMs and provide a foundation for more targeted, modular improvements in safe embodied reasoning.
arXiv Detail & Related papers (2025-05-26T13:01:14Z)
Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning Models [29.569220030102986]
We introduce textbfBeyond Safe Answers (BSA) bench, a novel benchmark comprising 2,000 challenging instances organized into three distinct SSA scenario types.<n> Evaluations of 19 state-of-the-art LRMs demonstrate the difficulty of this benchmark, with top-performing models achieving only 38.0% accuracy in correctly identifying risk rationales.<n>Our work provides a comprehensive assessment tool for evaluating and improving safety reasoning fidelity in LRMs, advancing the development of genuinely risk-aware and reliably safe AI systems.
arXiv Detail & Related papers (2025-05-26T08:49:19Z)
Think in Safety: Unveiling and Mitigating Safety Alignment Collapse in Multimodal Large Reasoning Model [30.774446187857475]
We conduct a safety evaluation of 11 Multimodal Large Reasoning Models (MLRMs) across 5 benchmarks.<n>Our analysis reveals distinct safety patterns across different benchmarks.<n>It is a potential approach to address safety issues in MLRMs by leveraging the intrinsic reasoning capabilities of the model to detect unsafe intent.
arXiv Detail & Related papers (2025-05-10T06:59:36Z)
SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models [50.34706204154244]
Acquiring reasoning capabilities catastrophically degrades inherited safety alignment.<n>Certain scenarios suffer 25 times higher attack rates.<n>Despite tight reasoning-answer safety coupling, MLRMs demonstrate nascent self-correction.
arXiv Detail & Related papers (2025-04-09T06:53:23Z)
The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1 [70.94607997570729]
We present a comprehensive safety assessment of OpenAI-o3 and DeepSeek-R1 reasoning models.<n>We investigate their susceptibility to adversarial attacks, such as jailbreaking and prompt injection, to assess their robustness in real-world applications.
arXiv Detail & Related papers (2025-02-18T09:06:07Z)
SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models [75.67623347512368]
We propose toolns, a comprehensive framework designed for conducting safety evaluations of MLLMs. Our framework consists of a comprehensive harmful query dataset and an automated evaluation protocol. Based on our framework, we conducted large-scale experiments on 15 widely-used open-source MLLMs and 6 commercial MLLMs.
arXiv Detail & Related papers (2024-10-24T17:14:40Z)
LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs [78.99703366417661]
Large language models (LLMs) increasingly assist in tasks ranging from procedural guidance to autonomous experiment orchestration.<n>Such overreliance is particularly dangerous in high-stakes laboratory settings, where failures in hazard identification or risk assessment can result in severe accidents.<n>We propose the Laboratory Safety Benchmark (LabSafety Bench) to evaluate models on their ability to identify potential hazards, assess risks, and predict the consequences of unsafe actions in lab environments.
arXiv Detail & Related papers (2024-10-18T05:21:05Z)
On the Role of Attention Heads in Large Language Model Safety [64.51534137177491]
Large language models (LLMs) achieve state-of-the-art performance on multiple language tasks, yet their safety guardrails can be circumvented.<n>We propose a novel metric which tailored for multi-head attention, the Safety Head ImPortant Score (Ships) to assess the individual heads' contributions to model safety.
arXiv Detail & Related papers (2024-10-17T16:08:06Z)
Mixture of insighTful Experts (MoTE): The Synergy of Thought Chains and Expert Mixtures in Self-Alignment [103.05005690990271]
Mixture of insighTful Experts (MoTE) is a novel framework that combines reasoning chains and expert mixtures to improve self-alignments.<n>MoTE significantly improves model safety, jailbreak resistance, and over-refusal capabilities, achieving performance comparable to OpenAI's state-of-the-art o1 model.
arXiv Detail & Related papers (2024-05-01T15:06:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.