Fugu-MT 論文翻訳(概要): Dropping Experts, Recombining Neurons: Retraining-Free Pruning for Sparse Mixture-of-Experts LLMs

論文の概要: Dropping Experts, Recombining Neurons: Retraining-Free Pruning for Sparse Mixture-of-Experts LLMs

arxiv url: http://arxiv.org/abs/2509.10377v1
Date: Fri, 12 Sep 2025 16:09:39 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-15 16:03:08.154668
Title: Dropping Experts, Recombining Neurons: Retraining-Free Pruning for Sparse Mixture-of-Experts LLMs
Title（参考訳）: 神経をリコンビネートする投球の専門家--Sparse Mixture-of-Experts LLMのためのリトレーニングフリープルーニング-
Authors: Yixiao Zhou, Ziyu Zhao, Dongzhou Cheng, zhiliang wu, Jie Gui, Yi Yang, Fei Wu, Yu Cheng, Hehe Fan,
Abstract要約: DERNは、専門家のプルーニングと再構築のためのタスク非依存でトレーニングなしのフレームワークである。コモンセンス推論やMMLUベンチマークでは、50%のエキスパートスパシティでパフォーマンスを5%以上向上させる。
参考スコア（独自算出の注目度）: 54.95810313530111
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sparse Mixture-of-Experts (SMoE) architectures are widely used in large language models (LLMs) due to their computational efficiency. However, though only a few experts are activated for each token, SMoE still requires loading all expert parameters, leading to high memory usage and challenges in deployment. Previous work has tried to reduce the overhead by pruning and merging experts, but primarily focused on expert-level operations, leaving neuron-level structure underexplored. We propose DERN (Dropping Experts, Recombining Neurons), a task-agnostic and retraining-free framework for expert pruning and reconstruction. We observe that experts are often misaligned and contain semantic conflicts at the neuron level, which poses challenges for direct merging. To solve this, DERN works in three steps: it first prunes redundant experts using router statistics; then it decomposes them into neuron-level expert segments, assigning each segment to its most compatible retained expert; and finally, it merges segments within each retained expert to build a compact representation. Experiments on Mixtral, Qwen, and DeepSeek SMoE models show that DERN improves performance by more than 5% on commonsense reasoning and MMLU benchmarks under 50% expert sparsity, without extra training. It also greatly reduces the number of experts and memory usage, making SMoE LLMs easier to deploy in practice.
Abstract（参考訳）: SMOE(Sparse Mixture-of-Experts)アーキテクチャは、計算効率のため、大規模言語モデル(LLM)で広く使われている。しかしながら、トークン毎にアクティベートされる専門家はごくわずかだが、SMoEは依然としてすべての専門家パラメータをロードする必要があるため、高いメモリ使用率とデプロイメント上の課題に繋がる。これまでの研究は、専門家を刈り取ってマージすることでオーバーヘッドを減らそうとしてきたが、主に専門家レベルの操作に焦点を当てており、ニューロンレベルの構造は過小評価されていない。我々は、専門家の刈り取りと再構築のためのタスク非依存かつトレーニング不要なフレームワークであるDERN(Dropping Experts, Recombining Neurons)を提案する。我々は、しばしば専門家は、ニューロンレベルで意味的な対立を伴っており、直接統合する上での課題を生じさせていることを観察する。この問題を解決するために、DERNは3つのステップで機能する:まずルータ統計を用いて冗長な専門家をプルーニングし、次にそれらをニューロンレベルの専門家セグメントに分解し、各セグメントを最も互換性のある専門家に割り当て、最後に、保持された各専門家内のセグメントをマージしてコンパクトな表現を構築する。 Mixtral、Qwen、DeepSeek SMoEモデルに関する実験によると、Dernは、余分なトレーニングなしで、50%のエキスパートスペシャリティ以下のコモンセンス推論とMMLUベンチマークで、パフォーマンスを5%以上改善している。また、専門家やメモリ使用量を大幅に削減し、SMoE LLMを実際にデプロイしやすくする。

論文の概要: Dropping Experts, Recombining Neurons: Retraining-Free Pruning for Sparse Mixture-of-Experts LLMs

関連論文リスト