Fugu-MT 論文翻訳(概要): BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity

論文の概要: BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity

arxiv url: http://arxiv.org/abs/2507.08771v2
Date: Wed, 30 Jul 2025 04:14:15 GMT
ステータス: 翻訳完了
システム内更新日: 2025-07-31 14:05:51.351758
Title: BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity
Title（参考訳）: BlockFFN: Chunk-Level Activation Sparsityによるエンドサイドアクセラレーションフレンドリーなエクスプロイトの混合を目指す
Authors: Chenyang Song, Weilin Zhao, Xu Han, Chaojun Xiao, Yingfa Chen, Yuxuan Li, Zhiyuan Liu, Maosong Sun,
Abstract要約: 我々は、新しいMoEアーキテクチャであるBlockFFNと、その効率的なトレーニングとデプロイメント技術を紹介します。具体的には、ReLUアクティベーションとRMSNormを統合したルータを、微分可能かつ柔軟なルーティングに使用します。次に、トークンレベルのスペーサ(TLS)とチャンクレベルのスペーサ(CLS)の両方を促進するために、CLS対応のトレーニング目標を設計し、BlockFFNをより加速しやすいものにした。
参考スコア（独自算出の注目度）: 66.94629945519125
License: http://creativecommons.org/licenses/by/4.0/
Abstract: To alleviate the computational burden of large language models (LLMs), architectures with activation sparsity, represented by mixture-of-experts (MoE), have attracted increasing attention. However, the non-differentiable and inflexible routing of vanilla MoE hurts model performance. Moreover, while each token activates only a few parameters, these sparsely-activated architectures exhibit low chunk-level sparsity, indicating that the union of multiple consecutive tokens activates a large ratio of parameters. Such a sparsity pattern is unfriendly for acceleration under low-resource conditions (e.g., end-side devices) and incompatible with mainstream acceleration techniques (e.g., speculative decoding). To address these challenges, we introduce a novel MoE architecture, BlockFFN, as well as its efficient training and deployment techniques. Specifically, we use a router integrating ReLU activation and RMSNorm for differentiable and flexible routing. Next, to promote both token-level sparsity (TLS) and chunk-level sparsity (CLS), CLS-aware training objectives are designed, making BlockFFN more acceleration-friendly. Finally, we implement efficient acceleration kernels, combining activation sparsity and speculative decoding for the first time. The experimental results demonstrate the superior performance of BlockFFN over other MoE baselines, achieving over 80% TLS and 70% 8-token CLS. Our kernels achieve up to 3.67$\times$ speedup on real end-side devices than dense models. All codes and checkpoints are available publicly (https://github.com/thunlp/BlockFFN).
Abstract（参考訳）: 大規模言語モデル(LLM)の計算負担を軽減するために,Mix-of-experts (MoE) で表されるアクティベーション間隔のアーキテクチャが注目されている。しかし、バニラMOEの非微分可能で非フレキシブルなルーティングはモデル性能を損なう。さらに、各トークンは数個のパラメータのみを活性化するが、これらの疎活性化されたアーキテクチャはチャンクレベルの空間性が低く、複数のトークンの結合がパラメータの大きな比率を活性化することを示す。このようなスパーシティパターンは、低リソース条件(例えばエンドサイドデバイス)下でのアクセラレーションには不向きであり、主流のアクセラレーション技術(例えば投機的デコーディング)と互換性がない。これらの課題に対処するために、新しいMoEアーキテクチャであるBlockFFNと、その効率的なトレーニングとデプロイメント技術を紹介します。具体的には、ReLUアクティベーションとRMSNormを統合したルータを、微分可能かつ柔軟なルーティングに使用します。次に、トークンレベルのスペーサ(TLS)とチャンクレベルのスペーサ(CLS)の両方を促進するために、CLS対応のトレーニング目標を設計し、BlockFFNをより加速しやすいものにした。最後に、アクティベーション空間と投機的復号化を初めて組み合わせ、効率的な加速カーネルを実装した。実験により, 他のMoEベースラインよりもBlockFFNの方が80%TLS, 8-token CLSよりも優れた性能を示した。私たちのカーネルは、高密度モデルよりも実際のエンドサイドデバイスで最大3.67$\times$スピードアップを実現しています。すべてのコードとチェックポイントが公開されている(https://github.com/thunlp/BlockFFN)。

論文の概要: BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity

関連論文リスト