Fugu-MT 論文翻訳(概要): Collaborative Compression for Large-Scale MoE Deployment on Edge

論文の概要: Collaborative Compression for Large-Scale MoE Deployment on Edge

arxiv url: http://arxiv.org/abs/2509.25689v1
Date: Tue, 30 Sep 2025 02:46:03 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 17:09:04.39806
Title: Collaborative Compression for Large-Scale MoE Deployment on Edge
Title（参考訳）: エッジ上の大規模MoE展開のための協調圧縮
Authors: Yixiao Chen, Yanyue Xie, Ruining Yang, Wei Jiang, Wei Wang, Yong He, Yue Chen, Pu Zhao, Yanzhi Wang,
Abstract要約: 本稿では,エキスパートプルーニング,混合精度量子化,アクティベーション最適化を組み合わせた協調圧縮フレームワークを提案する。私たちは、超大型のDeepSeek-V3から128GBのメモリ制限で圧縮されたモデルを初めてプラットフォームにデプロイしました。
参考スコア（独自算出の注目度）: 40.79738603826354
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Mixture of Experts (MoE) architecture is an important method for scaling Large Language Models (LLMs). It increases model capacity while keeping computation cost low. However, the ultra-large MoE models still have hundreds of billions of parameters, requiring massive memory/storage and leading to difficulties for deployment on resource-constrained edge platforms. Pruning or quantization alone can hardly address the issue, because of the super-aggressive compression ratio with significantly degraded accuracy and output quality. To facilitate the deployment of ultra-large MoEs on edge platforms, we propose a collaborative compression framework by combining expert pruning, mixed-precision quantization, and activation optimization. It can effectively reduce the storage footprint of the ultra-large MoE DeepSeek-V3 from 1.3TB to 103GB, while preserving high output quality with better accuracy than traditional uniform low-bit quantization methods. To the best of our knowledge, we are the first to deploy a compressed model from the ultra-large DeepSeek-V3 on the platform with a strict 128GB total memory limit. Our comprehensive experiments on multiple benchmarks under various memory constraints demonstrate the effectiveness of our method with smaller model sizes and higher accuracy than uniform low-bit quantization methods.
Abstract（参考訳）: Mixture of Experts (MoE)アーキテクチャは、LLM(Large Language Models)をスケールするための重要な方法である。計算コストを低く保ちながら、モデル容量を増加させる。しかし、超大型のMoEモデルはまだ数十億のパラメータを持ち、大量のメモリ/ストレージを必要とするため、リソース制約のあるエッジプラットフォームへのデプロイが困難になる。プルーニングや量子化だけでは、精度と出力品質が著しく低下した超攻撃的圧縮比のため、この問題にほとんど対処できない。エッジプラットフォーム上での超大型MoEの展開を容易にするため,エキスパートプルーニング,混合精度量子化,アクティベーション最適化を組み合わせた協調圧縮フレームワークを提案する。超大型のMoE DeepSeek-V3のストレージフットプリントを1.3TBから103GBに効果的に削減し、従来の均一な低ビット量子化法よりも高い出力品質を保っている。私たちの知る限りでは、私たちは、超大型のDeepSeek-V3から圧縮されたモデルを、厳格な128GBのメモリ制限でプラットフォームに展開した最初の人物です。様々なメモリ制約下での複数のベンチマークに関する総合的な実験は、モデルサイズが小さく、均一な低ビット量子化法よりも高い精度で、本手法の有効性を実証している。

論文の概要: Collaborative Compression for Large-Scale MoE Deployment on Edge

関連論文リスト