Fugu-MT 論文翻訳(概要): MoBiE: Efficient Inference of Mixture of Binary Experts under Post-Training Quantization

論文の概要: MoBiE: Efficient Inference of Mixture of Binary Experts under Post-Training Quantization

arxiv url: http://arxiv.org/abs/2604.06798v1
Date: Wed, 08 Apr 2026 08:12:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-09 17:30:51.418555
Title: MoBiE: Efficient Inference of Mixture of Binary Experts under Post-Training Quantization
Title（参考訳）: MoBiE: 評価後の量子化の下でのバイナリエキスパートの混合の効率的な推論
Authors: Zhixiong Zhao, Zukang Xu, Zhixuan Chen, Dawei Yang,
Abstract要約: MoBiEは、Mixture-of-Experts (MoE)ベースの大規模言語モデル(LLM)用に設計されたバイナライズフレームワークである。 MoBiEは、複数のMoEベースのLLMとベンチマークで最先端のバイナリメソッドを一貫して上回っている。
参考スコア（独自算出の注目度）: 11.19613037505662
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Mixture-of-Experts (MoE) based large language models (LLMs) offer strong performance but suffer from high memory and computation costs. Weight binarization provides extreme efficiency, yet existing binary methods designed for dense LLMs struggle with MoE-specific issues, including cross-expert redundancy, task-agnostic importance estimation, and quantization-induced routing shifts. To this end, we propose MoBiE, the first binarization framework tailored for MoE-based LLMs. MoBiE is built on three core innovations: 1. using joint SVD decomposition to reduce cross-expert redundancy; 2. integrating global loss gradients into local Hessian metrics to enhance weight importance estimation; 3. introducing an error constraint guided by the input null space to mitigate routing distortion. Notably, MoBiE achieves these optimizations while incurring no additional storage overhead, striking a balance between efficiency and model performance. Extensive experiments demonstrate that MoBiE consistently outperforms state-of-the-art binary methods across multiple MoE-based LLMs and benchmarks. For example, on Qwen3-30B-A3B, MoBiE reduces perplexity by 52.2$\%$, improves average zero-shot performance by 43.4$\%$, achieves over 2 $\times$ inference speedup, and further shortens quantization time. The code is available at https://github.com/Kishon-zzx/MoBiE.
Abstract（参考訳）: Mixture-of-Experts (MoE) ベースの大規模言語モデル (LLM) は高い性能を提供するが、高いメモリと計算コストに悩まされている。重み二項化は極端に効率が良いが、高密度LLM向けに設計された既存のバイナリメソッドは、クロスエキスパート冗長性、タスク非依存の重要度推定、量子化によるルーティングシフトなど、MoE固有の問題に苦しむ。そこで本研究では,MoE ベースの LLM に適した最初のバイナライズフレームワークである MoBiE を提案する。 MoBiEは3つのコアイノベーションの上に構築されている。一クロスエキスパート冗長性を低減するために共同SVD分解を用いること。 2.グローバル損失勾配を局所ヘッセン指標に統合し、重み付けの重み付けを向上する。 3. 入力null空間によって導かれるエラー制約を導入して、ルーティングの歪みを緩和する。特に、MoBiEはこれらの最適化を達成しつつ、追加のストレージオーバーヘッドを発生させることなく、効率とモデルパフォーマンスのバランスを保っている。大規模な実験により、MoBiEは複数のMoEベースのLLMとベンチマークで最先端のバイナリメソッドを一貫して上回ることを示した。例えば、Qwen3-30B-A3Bでは、MoBiEは難易度を52.2$\%$に下げ、平均ゼロショット性能を43.4$\%$に改善し、2$\times$推論スピードアップを達成し、さらに量子化時間を短縮する。コードはhttps://github.com/Kishon-zzx/MoBiE.comで入手できる。

論文の概要: MoBiE: Efficient Inference of Mixture of Binary Experts under Post-Training Quantization

関連論文リスト