Related papers: BadMoE: Backdooring Mixture-of-Experts LLMs via Optimizing Routing Triggers and Infecting Dormant Experts

BadMoE: Backdooring Mixture-of-Experts LLMs via Optimizing Routing Triggers and Infecting Dormant Experts

URL: http://arxiv.org/abs/2504.18598v2
Date: Tue, 29 Apr 2025 02:23:57 GMT
Title: BadMoE: Backdooring Mixture-of-Experts LLMs via Optimizing Routing Triggers and Infecting Dormant Experts
Authors: Qingyue Wang, Qi Pang, Xixun Lin, Shuai Wang, Daoyuan Wu,
Abstract summary: Mixture-of-Experts (MoE) have emerged as a powerful architecture for large language models (LLMs)<n>This paper presents the first backdoor attack against MoE-based LLMs where the attackers poison dormant experts''<n>We also show that dormant experts can serve as dominating experts to manipulate model predictions.
Score: 12.755458703336153
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Mixture-of-Experts (MoE) have emerged as a powerful architecture for large language models (LLMs), enabling efficient scaling of model capacity while maintaining manageable computational costs. The key advantage lies in their ability to route different tokens to different ``expert'' networks within the model, enabling specialization and efficient handling of diverse input. However, the vulnerabilities of MoE-based LLMs still have barely been studied, and the potential for backdoor attacks in this context remains largely unexplored. This paper presents the first backdoor attack against MoE-based LLMs where the attackers poison ``dormant experts'' (i.e., underutilized experts) and activate them by optimizing routing triggers, thereby gaining control over the model's output. We first rigorously prove the existence of a few ``dominating experts'' in MoE models, whose outputs can determine the overall MoE's output. We also show that dormant experts can serve as dominating experts to manipulate model predictions. Accordingly, our attack, namely BadMoE, exploits the unique architecture of MoE models by 1) identifying dormant experts unrelated to the target task, 2) constructing a routing-aware loss to optimize the activation triggers of these experts, and 3) promoting dormant experts to dominating roles via poisoned training data. Extensive experiments show that BadMoE successfully enforces malicious prediction on attackers' target tasks while preserving overall model utility, making it a more potent and stealthy attack than existing methods.

Related papers

Self-Purification Mitigates Backdoors in Multimodal Diffusion Language Models [74.1970982768771]
We show that well-established data-poisoning pipelines can successfully implant backdoors into MDLMs.<n>We introduce a backdoor defense framework for MDLMs named DiSP (Diffusion Self-Purification)
arXiv Detail & Related papers (2026-02-24T15:47:52Z)
Exploiting the Experts: Unauthorized Compression in MoE-LLMs [1.580774794371876]
We study the prunability of MoE-LLMs under task-specific usage.<n>We propose defense strategies that aim to make MoE models harder to compress and fine-tune without authorization.
arXiv Detail & Related papers (2025-11-22T20:08:29Z)
Backdoor Unlearning by Linear Task Decomposition [69.91984435094157]
Foundation models are highly susceptible to adversarial perturbations and targeted backdoor attacks.<n>Existing backdoor removal approaches rely on costly fine-tuning to override the harmful behavior.<n>This raises the question of whether backdoors can be removed without compromising the general capabilities of the models.
arXiv Detail & Related papers (2025-10-16T16:18:07Z)
Who Speaks for the Trigger? Dynamic Expert Routing in Backdoored Mixture-of-Experts Transformers [12.47462301643593]
Large language models (LLMs) with Mixture-of-Experts (MoE) architectures achieve impressive performance and efficiency by dynamically routing inputs to specializedworks, known as experts.<n>We propose BadSwitch, a novel backdoor framework that integrates task-coupled dynamic trigger optimization with a sensitivity-guided Top-S expert tracing mechanism.
arXiv Detail & Related papers (2025-10-15T12:11:02Z)
Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution [49.78359632298156]
Large language models (LLMs) have seen significant advancements, achieving superior performance in various Natural Language Processing (NLP) tasks.<n> backdoor attacks, where models behave normally for standard queries but generate harmful responses or unintended output when specific triggers are activated.<n>We present LETHE, a novel method to eliminate backdoor behaviors from LLMs through knowledge dilution.
arXiv Detail & Related papers (2025-08-28T17:05:18Z)
MISLEADER: Defending against Model Extraction with Ensembles of Distilled Models [56.09354775405601]
Model extraction attacks aim to replicate the functionality of a black-box model through query access.<n>Most existing defenses presume that attacker queries have out-of-distribution (OOD) samples, enabling them to detect and disrupt suspicious inputs.<n>We propose MISLEADER, a novel defense strategy that does not rely on OOD assumptions.
arXiv Detail & Related papers (2025-06-03T01:37:09Z)
Autonomy-of-Experts Models [34.82103329222486]
We propose a novel MoE paradigm in which experts autonomously select themselves to process inputs.<n>AoE is based on the insight that an expert is aware of its own capacity to effectively process a token.<n>Only the top-ranking experts proceed with the forward pass, while the others abort.
arXiv Detail & Related papers (2025-01-22T18:37:08Z)
Towards Adversarial Robustness of Model-Level Mixture-of-Experts Architectures for Semantic Segmentation [11.311414617703308]
We evaluate the adversarial vulnerability of MoEs for semantic segmentation of urban and highway traffic scenes. We show that MoEs are, in most cases, more robust to per-instance and universal white-box adversarial attacks and can better withstand transfer attacks.
arXiv Detail & Related papers (2024-12-16T09:49:59Z)
UOE: Unlearning One Expert Is Enough For Mixture-of-experts LLMS [35.237427998489785]
We propose a novel single-expert unlearning framework, referred to as UOE, for Mixture-of-Experts (MoE) LLMs.<n>Through expert attribution, unlearning is concentrated on the most actively engaged expert for the specified knowledge.<n>We show that UOE enhances both forget quality up to 5% and model utility by 35% on MoE LLMs across various benchmarks.
arXiv Detail & Related papers (2024-11-27T22:46:08Z)
MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts [63.67734699877724]
MoE++ is a general and heterogeneous MoE framework that integrates both Feed-Forward Network(FFN) and zero-computation experts. MoE++ achieves better performance while delivering 1.1-2.1x expert forward throughput compared to a vanilla MoE model of the same size.
arXiv Detail & Related papers (2024-10-09T18:01:27Z)
MEGen: Generative Backdoor in Large Language Models via Model Editing [56.46183024683885]
Large language models (LLMs) have demonstrated remarkable capabilities. Their powerful generative abilities enable flexible responses based on various queries or instructions. This paper proposes an editing-based generative backdoor, named MEGen, aiming to create a customized backdoor for NLP tasks with the least side effects.
arXiv Detail & Related papers (2024-08-20T10:44:29Z)
Revisiting Backdoor Attacks against Large Vision-Language Models from Domain Shift [104.76588209308666]
This paper explores backdoor attacks in LVLM instruction tuning across mismatched training and testing domains.<n>We introduce a new evaluation dimension, backdoor domain generalization, to assess attack robustness.<n>We propose a multimodal attribution backdoor attack (MABA) that injects domain-agnostic triggers into critical areas.
arXiv Detail & Related papers (2024-06-27T02:31:03Z)
A Closer Look into Mixture-of-Experts in Large Language Models [26.503570706063634]
Mixture-of-experts (MoE) is gaining increasing attention due to its unique properties and remarkable performance. MoE architecture could increase the model size without sacrificing computational efficiency. We make an initial attempt to understand the inner workings of MoE-based large language models.
arXiv Detail & Related papers (2024-06-26T10:07:57Z)
A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts [49.394145046409044]
This paper provides the first provably efficient technique for pruning experts in finetuned MoE models. We theoretically prove that prioritizing the pruning of the experts with a smaller change of the routers l2 norm from the pretrained model guarantees the preservation of test accuracy. Although our theoretical analysis is centered on binary classification tasks on simplified MoE architecture, our expert pruning method is verified on large vision MoE models.
arXiv Detail & Related papers (2024-05-26T17:52:58Z)
SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts [49.01990048827639]
We introduce SEER-MoE, a framework for reducing both the memory footprint and compute requirements of pre-trained MoE models. The first stage involves pruning the total number of experts using a heavy-hitters counting guidance, while the second stage employs a regularization-based fine-tuning strategy to recover accuracy loss. Our empirical studies demonstrate the effectiveness of our method, resulting in a sparse MoEs model optimized for inference efficiency with minimal accuracy trade-offs.
arXiv Detail & Related papers (2024-04-07T22:13:43Z)
Effective Backdoor Mitigation in Vision-Language Models Depends on the Pre-training Objective [71.39995120597999]
Modern machine learning models are vulnerable to adversarial and backdoor attacks. Such risks are heightened by the prevalent practice of collecting massive, internet-sourced datasets for training multimodal models. CleanCLIP is the current state-of-the-art approach to mitigate the effects of backdooring in multimodal models.
arXiv Detail & Related papers (2023-11-25T06:55:13Z)
Mitigating Backdoors in Federated Learning with FLD [7.908496863030483]
Federated learning allows clients to collaboratively train a global model without uploading raw data for privacy preservation. This feature has recently been found responsible for federated learning's vulnerability in the face of backdoor attacks. We propose Federated Layer Detection (FLD), a novel model filtering approach for effectively defending against backdoor attacks.
arXiv Detail & Related papers (2023-03-01T07:54:54Z)
MoEC: Mixture of Expert Clusters [93.63738535295866]
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead. MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated. However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation.
arXiv Detail & Related papers (2022-07-19T06:09:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.