Fugu-MT 論文翻訳(概要): OmniMoE: An Efficient MoE by Orchestrating Atomic Experts at Scale

論文の概要: OmniMoE: An Efficient MoE by Orchestrating Atomic Experts at Scale

arxiv url: http://arxiv.org/abs/2602.05711v1
Date: Thu, 05 Feb 2026 14:37:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-06 18:49:08.977603
Title: OmniMoE: An Efficient MoE by Orchestrating Atomic Experts at Scale
Title（参考訳）: OmniMoE: 大規模に原子エキスパートを編成する効率的なMoE
Authors: Jingze Shi, Zhangyang Peng, Yizhang Zhu, Yifan Wu, Guang Liu, Yuyu Luo,
Abstract要約: 我々は、エキスパートの粒度を論理的に極端に推し進めるシステム・アルゴリズムの共同設計フレームワークであるOmniMoEを提案する。 OmniMoEは、単一のMoE層内でスケーラブルなルーティングと実行を導入し、汎用処理のための共有高密度ブランチを維持している。 OmniMoEは、7つのベンチマークで50.9%のゼロショット精度を実現し、粗い粒度(DeepSeekMoEなど)、きめ細かい粒度(PEERなど)を上回ります。
参考スコア（独自算出の注目度）: 11.733927781098805
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mixture-of-Experts (MoE) architectures are evolving towards finer granularity to improve parameter efficiency. However, existing MoE designs face an inherent trade-off between the granularity of expert specialization and hardware execution efficiency. We propose OmniMoE, a system-algorithm co-designed framework that pushes expert granularity to its logical extreme. OmniMoE introduces vector-level Atomic Experts, enabling scalable routing and execution within a single MoE layer, while retaining a shared dense MLP branch for general-purpose processing. Although this atomic design maximizes capacity, it poses severe challenges for routing complexity and memory access. To address these, OmniMoE adopts a system-algorithm co-design: (i) a Cartesian Product Router that decomposes the massive index space to reduce routing complexity from O(N) to O(sqrt(N)); and (ii) Expert-Centric Scheduling that inverts the execution order to turn scattered, memory-bound lookups into efficient dense matrix operations. Validated on seven benchmarks, OmniMoE (with 1.7B active parameters) achieves 50.9% zero-shot accuracy across seven benchmarks, outperforming coarse-grained (e.g., DeepSeekMoE) and fine-grained (e.g., PEER) baselines. Crucially, OmniMoE reduces inference latency from 73ms to 6.7ms (a 10.9-fold speedup) compared to PEER, demonstrating that massive-scale fine-grained MoE can be fast and accurate. Our code is open-sourced at https://github.com/flash-algo/omni-moe.
Abstract（参考訳）: パラメータ効率を向上させるため、Mixture-of-Experts (MoE)アーキテクチャはより細かい粒度へと進化している。しかし、既存のMoE設計は、専門家の専門化とハードウェアの実行効率の粒度のトレードオフに直面している。我々は、エキスパートの粒度を論理的に極端に推し進めるシステム・アルゴリズムの共同設計フレームワークであるOmniMoEを提案する。 OmniMoEはベクトルレベルのアトミックエキスパートを導入し、単一のMoE層内でスケーラブルなルーティングと実行を可能にし、汎用処理のために共有密度の深いMLPブランチを保持する。このアトミック設計はキャパシティを最大化するが、ルーティングの複雑さとメモリアクセスに深刻な課題をもたらす。これらの問題に対処するため、OmniMoEはシステム・アルゴリズムの共設計を採用しています。 (i)O(N)からO(sqrt(N))へのルーティングの複雑さを低減するために巨大なインデックス空間を分解するモンテカルロ製品ルータ (ii) 分散したメモリバウンドなルックアップを効率的な高密度行列演算に変換するために実行順序を反転させるエキスパート中心スケジューリング。 7つのベンチマークで検証されたOmniMoE(アクティブパラメータ 1.7B)は、7つのベンチマークで50.9%のゼロショット精度を実現し、粗い粒度(例えば、DeepSeekMoE)と細粒度(例えば、PEER)のベースラインを上回っている。重要なことに、OmniMoEはPEERと比較して、推論のレイテンシを73msから6.7ms(10.9倍のスピードアップ)に削減し、大規模な粒度のMoEが高速かつ正確であることを示した。私たちのコードはhttps://github.com/flash-algo/omni-moe.comでオープンソース化されています。

論文の概要: OmniMoE: An Efficient MoE by Orchestrating Atomic Experts at Scale

関連論文リスト