Fugu-MT 論文翻訳(概要): MLPMoE: Zero-Shot Architectural Metamorphosis of Dense LLM MLPs into Static Mixture-of-Experts

論文の概要: MLPMoE: Zero-Shot Architectural Metamorphosis of Dense LLM MLPs into Static Mixture-of-Experts

arxiv url: http://arxiv.org/abs/2511.21089v1
Date: Wed, 26 Nov 2025 06:14:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-27 18:37:58.986171
Title: MLPMoE: Zero-Shot Architectural Metamorphosis of Dense LLM MLPs into Static Mixture-of-Experts
Title（参考訳）: MLPMoE:高密度LCM MLPのゼロショット構造変態
Authors: Ivan Novikov,
Abstract要約: 大規模言語モデル(LLM)は、主に高密度トランスフォーマーとしてデプロイされ、すべてのトークンに対してフィードフォワードブロック内の全てのパラメータがアクティブになる。 MoEfication、CMoE、ToMoE、MoOREといった最近のアップサイクリング手法は、高密度フィードフォワードネットワーク内の疎小で半モジュラーなサブ構造に有用な計算の大部分が存在していることを明らかにしている。本稿では,高密度の変圧器ブロックを静的な高心性混合体に再構成する学習自由変換であるMoE(MLP-Experts)を紹介する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are predominantly deployed as dense transformers, where every parameter in every feed-forward block is activated for every token. While architecturally simple, this is computationally inefficient, since inference costs scale linearly with parameter count. Recent upcycling methods such as MoEfication, CMoE, ToMoE, and MoORE reveal that much of the useful computation lives in sparse, semi-modular substructures inside dense feed-forward networks, but these approaches typically rely on clustering, activation profiling, singular value decomposition, or custom routing that requires calibration data. This paper introduces MLPMoE (MLP Mixture-of-Experts), a training-free, deterministic transformation that restructures the dense MLP in transformer blocks into a static, high-cardinality mixture of experts. The transformation uses simple tensor slicing and summation, reinterpreting the algebra of tensor parallelism as a topological conversion rather than a distributed training pattern. We further introduce Fractal Fade (differential branch sparsity) and Compensated Pruning (variance-preserving branch reduction) as lightweight mechanisms for structured sparsity. On Qwen2.5-0.5B-Instruct and DeepSeek-R1-Distill-Llama-8B, the zero-shot MLPMoE transform changes a proxy perplexity metric by less than 0.05 percent while keeping the parameter count effectively constant. On the 8B model, differential sparsity removes about 20 percent of MLP parameters while keeping perplexity within about 2 percent of the dense baseline. The method operates entirely post hoc on existing checkpoints and does not require gradients, calibration sets, or router training. Code is available at https://gist.github.com/iwallarm/fc2ef1eddf226ca7814f9e5e2ae9bad1
Abstract（参考訳）: 大規模言語モデル(LLM)は、主に高密度トランスフォーマーとしてデプロイされ、すべてのトークンに対してフィードフォワードブロック内の全てのパラメータがアクティブになる。推論コストはパラメータ数とともに線形にスケールするため、アーキテクチャ上は単純であるが、これは計算的に非効率である。 MoEfication、CMoE、ToMoE、MoOREといった最近のアップサイクリング手法では、高密度フィードフォワードネットワーク内の疎小で半モジュラーなサブストラクチャに有用な計算の大部分が存在しているが、これらのアプローチはクラスタリング、アクティベーションプロファイリング、特異値分解、キャリブレーションデータを必要とするカスタムルーティングに依存している。本稿では,MLPMoE(MLP Mixture-of-Experts)について紹介する。この変換は単純なテンソルスライシングと和を使い、テンソル並列性の代数を分散トレーニングパターンではなくトポロジカル変換として再解釈する。さらに, Fractal Fade (差分枝幅) と Compensated Pruning (分散保存枝縮小) を, 構造的疎結合の軽量化機構として導入する。 Qwen2.5-0.5B-InstructとDeepSeek-R1-Distill-Llama-8Bでは、ゼロショットMLPMoE変換は、パラメータカウントを効果的に一定に保ちながらプロキシパープレキシティメトリックを0.05パーセント以下に変化させる。 8Bモデルでは, MLPパラメータの約20%を除去し, 密度ベースラインの約2%にパープレキシティを保持する。この方法は既存のチェックポイントで完全にポストホックで動作し、勾配、校正セット、ルータのトレーニングを必要としない。コードはhttps://gist.github.com/iwallarm/fc2ef1eddf226ca7814f9e5e5e9bad1で公開されている。

論文の概要: MLPMoE: Zero-Shot Architectural Metamorphosis of Dense LLM MLPs into Static Mixture-of-Experts

関連論文リスト