Fugu-MT 論文翻訳(概要): Ban&Pick: Achieving Free Performance Gains and Inference Speedup via Smarter Routing in MoE-LLMs

論文の概要: Ban&Pick: Achieving Free Performance Gains and Inference Speedup via Smarter Routing in MoE-LLMs

arxiv url: http://arxiv.org/abs/2509.06346v1
Date: Mon, 08 Sep 2025 05:38:10 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-09 14:07:03.982521
Title: Ban&Pick: Achieving Free Performance Gains and Inference Speedup via Smarter Routing in MoE-LLMs
Title（参考訳）: Ban&Pick: MoE-LLMのスマートルーティングによるフリーパフォーマンス向上と推論スピードアップを実現する
Authors: Yuanteng Chen, Peisong Wang, Yuantian Shao, Jian Cheng,
Abstract要約: 我々は、よりスマートなMoEルーティングのためのポストトレーニング、プラグイン・アンド・プレイ戦略であるBan&Pickを紹介する。 Ban&Pickは、トレーニングやアーキテクチャの変更なしに、無料のパフォーマンス向上と推論アクセラレーションを提供する。例えばQwen3-30B-A3Bでは、AIME2024では80.67から84.66に、GPQA-ダイアモンドでは65.66から68.18に改善され、vLLMでは1.25倍の推論が加速される。
参考スコア（独自算出の注目度）: 25.27147729066472
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sparse Mixture-of-Experts (MoE) has become a key architecture for scaling large language models (LLMs) efficiently. Recent fine-grained MoE designs introduce hundreds of experts per layer, with multiple experts activated per token, enabling stronger specialization. However, during pre-training, routers are optimized mainly for stability and robustness: they converge prematurely and enforce balanced usage, limiting the full potential of model performance and efficiency. In this work, we uncover two overlooked issues: (i) a few highly influential experts are underutilized due to premature and balanced routing decisions; and (ii) enforcing a fixed number of active experts per token introduces substantial redundancy. Instead of retraining models or redesigning MoE architectures, we introduce Ban&Pick, a post-training, plug-and-play strategy for smarter MoE routing. Pick discovers and reinforces key experts-a small group with outsized impact on performance-leading to notable accuracy gains across domains. Ban complements this by dynamically pruning redundant experts based on layer and token sensitivity, delivering faster inference with minimal accuracy loss. Experiments on fine-grained MoE-LLMs (DeepSeek, Qwen3) across math, code, and general reasoning benchmarks demonstrate that Ban&Pick delivers free performance gains and inference acceleration without retraining or architectural changes. For instance, on Qwen3-30B-A3B, it improves accuracy from 80.67 to 84.66 on AIME2024 and from 65.66 to 68.18 on GPQA-Diamond, while accelerating inference by 1.25x under the vLLM.
Abstract（参考訳）: Sparse Mixture-of-Experts (MoE) は大規模言語モデル(LLM)を効率的にスケーリングするための重要なアーキテクチャとなっている。最近の細かいMoE設計では、層ごとに数百のエキスパートが登場し、トークンごとに複数のエキスパートがアクティベートされ、より強力な特殊化が実現されている。しかし、事前トレーニングの間、ルータは主に安定性と堅牢性のために最適化されており、それらは早期に収束し、バランスの取れた使用を強制し、モデル性能と効率の完全なポテンシャルを制限する。この研究で、見過ごされた2つの問題を発見しました。 (i)未熟かつバランスの取れた経路決定のため、非常に影響力のある専門家が不足していること。 (二)トークンごとに一定の数のアクティブエキスパートを強制することは、かなりの冗長性をもたらす。モデルの再トレーニングやMoEアーキテクチャの再設計の代わりに、よりスマートなMoEルーティングのためのポストトレーニング、プラグイン・アンド・プレイ戦略であるBan&Pickを紹介します。 Pickは重要な専門家の発見と強化を行う - パフォーマンス向上に大きな影響を与える小さなグループで、ドメイン間での顕著な精度向上に寄与する。 Ban氏は、レイヤとトークンの感度に基づいて、冗長な専門家を動的に刈り取ることでこれを補完する。数学、コード、一般的な推論ベンチマークを含む詳細なMoE-LLM(DeepSeek, Qwen3)の実験では、Ban&Pickは、トレーニングやアーキテクチャの変更なしに、無料のパフォーマンス向上と推論アクセラレーションを提供する。例えばQwen3-30B-A3Bでは、AIME2024では80.67から84.66に、GPQA-ダイアモンドでは65.66から68.18に改善され、vLLMでは1.25倍の推論が加速される。

論文の概要: Ban&Pick: Achieving Free Performance Gains and Inference Speedup via Smarter Routing in MoE-LLMs

関連論文リスト