Related papers: Steering MoE LLMs via Expert (De)Activation

Steering MoE LLMs via Expert (De)Activation

URL: http://arxiv.org/abs/2509.09660v1
Date: Thu, 11 Sep 2025 17:55:09 GMT
Title: Steering MoE LLMs via Expert (De)Activation
Authors: Mohsen Fayyaz, Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Ryan Rossi, Trung Bui, Hinrich Schütze, Nanyun Peng,
Abstract summary: Mixture-of-Experts (MoE) in Large Language Models (LLMs) routes each token through a subset of specialized Feed-Forward Networks (FFN)<n>We present SteerMoE, a framework for steering MoE models by detecting and controlling behavior-linked experts.
Score: 118.23403783503026
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Mixture-of-Experts (MoE) in Large Language Models (LLMs) routes each token through a subset of specialized Feed-Forward Networks (FFN), known as experts. We present SteerMoE, a framework for steering MoE models by detecting and controlling behavior-linked experts. Our detection method identifies experts with distinct activation patterns across paired inputs exhibiting contrasting behaviors. By selectively (de)activating such experts during inference, we control behaviors like faithfulness and safety without retraining or modifying weights. Across 11 benchmarks and 6 LLMs, our steering raises safety by up to +20% and faithfulness by +27%. In adversarial attack mode, it drops safety by -41% alone, and -100% when combined with existing jailbreak methods, bypassing all safety guardrails and exposing a new dimension of alignment faking hidden within experts.

Related papers

Leave It to the Experts: Detecting Knowledge Distillation via MoE Expert Signatures [57.98221536489363]
Knowledge Distillation (KD) accelerates training of large language models (LLMs) but poses intellectual property protection and diversity risks.<n>We present a KD detection framework effective in both white-box and black-box settings by exploiting an overlooked signal: the transfer of MoE "structural habits"<n>Our approach analyzes how different experts specialize and collaborate across various inputs, creating distinctive fingerprints that persist through the distillation process.
arXiv Detail & Related papers (2025-10-19T19:15:08Z)
Who Speaks for the Trigger? Dynamic Expert Routing in Backdoored Mixture-of-Experts Transformers [12.47462301643593]
Large language models (LLMs) with Mixture-of-Experts (MoE) architectures achieve impressive performance and efficiency by dynamically routing inputs to specializedworks, known as experts.<n>We propose BadSwitch, a novel backdoor framework that integrates task-coupled dynamic trigger optimization with a sensitivity-guided Top-S expert tracing mechanism.
arXiv Detail & Related papers (2025-10-15T12:11:02Z)
DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models [50.21378052667732]
We conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics.<n>We propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach.
arXiv Detail & Related papers (2025-09-29T05:17:10Z)
The Rogue Scalpel: Activation Steering Compromises LLM Safety [11.402179030703188]
Activation steering is a technique for controlling LLM behavior by adding semantically meaningful vectors directly into a model's hidden states during inference.<n>We demonstrate the opposite: steering systematically breaks model alignment safeguards, making it comply with harmful requests.
arXiv Detail & Related papers (2025-09-26T08:49:47Z)
Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training [1.5349686675266894]
Current methods for content safety in Large Language Models (LLMs) rely on multi-stage training pipelines.<n>We propose a unified co-training framework that efficiently integrates multiple safety behaviors.<n>We show that our method matches the safety alignment quality of SFT+DPO, with our 8B model notably surpassing DeepSeek-R1 (671B) in safety performance.
arXiv Detail & Related papers (2025-08-12T02:39:33Z)
SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification [26.937824679384097]
We formalize and systematically study MoE model's positional vulnerability.<n>We present SAFEx, an analytical framework that robustly identifies, characterizes, and validates the safety-critical experts.
arXiv Detail & Related papers (2025-06-20T15:09:10Z)
Robust Anti-Backdoor Instruction Tuning in LVLMs [53.766434746801366]
We introduce a lightweight, certified-agnostic defense framework for large visual language models (LVLMs)<n>Our framework finetunes only adapter modules and text embedding layers under instruction tuning.<n>Experiments against seven attacks on Flickr30k and MSCOCO demonstrate that ours reduces their attack success rate to nearly zero.
arXiv Detail & Related papers (2025-06-04T01:23:35Z)
Mixture of Tunable Experts - Behavior Modification of DeepSeek-R1 at Inference Time [1.1655046053160683]
We present a method that extends the Mixture-of-Experts architecture of Large Language Models (LLMs)<n>MoTE enables meaningful and focused behavior changes in LLMs on-the-fly during inference time.
arXiv Detail & Related papers (2025-02-16T12:24:39Z)
Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control [44.326363467045496]
Large Language Models (LLMs) have become a critical area of research in Reinforcement Learning from Human Feedback (RLHF) representation engineering offers a new, training-free approach. This technique leverages semantic features to control the representation of LLM's intermediate hidden states. It is difficult to encode various semantic contents, like honesty and safety, into a singular semantic feature.
arXiv Detail & Related papers (2024-11-04T08:36:03Z)
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs [54.10865585773691]
We introduce WildGuard -- an open, light-weight moderation tool for LLM safety.<n>WildGuard achieves three goals: identifying malicious intent in user prompts, detecting safety risks of model responses, and determining model refusal rate.
arXiv Detail & Related papers (2024-06-26T16:58:20Z)
BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models [57.5404308854535]
Safety backdoor attacks in large language models (LLMs) enable the stealthy triggering of unsafe behaviors while evading detection during normal interactions. We present BEEAR, a mitigation approach leveraging the insight that backdoor triggers induce relatively uniform drifts in the model's embedding space. Our bi-level optimization method identifies universal embedding perturbations that elicit unwanted behaviors and adjusts the model parameters to reinforce safe behaviors against these perturbations.
arXiv Detail & Related papers (2024-06-24T19:29:47Z)
Mixture of insighTful Experts (MoTE): The Synergy of Thought Chains and Expert Mixtures in Self-Alignment [103.05005690990271]
Mixture of insighTful Experts (MoTE) is a novel framework that combines reasoning chains and expert mixtures to improve self-alignments.<n>MoTE significantly improves model safety, jailbreak resistance, and over-refusal capabilities, achieving performance comparable to OpenAI's state-of-the-art o1 model.
arXiv Detail & Related papers (2024-05-01T15:06:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.