Fugu-MT 論文翻訳(概要): HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space

論文の概要: HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space

arxiv url: http://arxiv.org/abs/2509.22299v1
Date: Fri, 26 Sep 2025 13:00:46 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-29 20:57:54.437773
Title: HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space
Title（参考訳）: HEAPr:Hessianをベースとした効率的な原子力専門家のアウトプット空間での運用
Authors: Ke Li, Zheng Yang, Zhongbin Zhou, Feng Xue, Zhonglin Jiang, Wenxiao Wang,
Abstract要約: HEAPrは、専門家を小さく、識別不能な原子エキスパートに分解する、新しい刈り取りアルゴリズムである。これは、原子専門家の固有の特性を利用して、2階の情報を専門家パラメータから原子専門家パラメータに変換する。これは、様々な圧縮率とベンチマークで、既存のエキスパートレベルのプルーニング手法よりも優れています。
参考スコア（独自算出の注目度）: 12.872890364287345
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Mixture-of-Experts (MoE) architectures in large language models (LLMs) deliver exceptional performance and reduced inference costs compared to dense LLMs. However, their large parameter counts result in prohibitive memory requirements, limiting practical deployment. While existing pruning methods primarily focus on expert-level pruning, this coarse granularity often leads to substantial accuracy degradation. In this work, we introduce HEAPr, a novel pruning algorithm that decomposes experts into smaller, indivisible atomic experts, enabling more precise and flexible atomic expert pruning. To measure the importance of each atomic expert, we leverage second-order information based on principles similar to Optimal Brain Surgeon (OBS) theory. To address the computational and storage challenges posed by second-order information, HEAPr exploits the inherent properties of atomic experts to transform the second-order information from expert parameters into that of atomic expert parameters, and further simplifies it to the second-order information of atomic expert outputs. This approach reduces the space complexity from $O(d^4)$, where d is the model's dimensionality, to $O(d^2)$. HEAPr requires only two forward passes and one backward pass on a small calibration set to compute the importance of atomic experts. Extensive experiments on MoE models, including DeepSeek MoE and Qwen MoE family, demonstrate that HEAPr outperforms existing expert-level pruning methods across a wide range of compression ratios and benchmarks. Specifically, HEAPr achieves nearly lossless compression at compression ratios of 20% ~ 25% in most models, while also reducing FLOPs nearly by 20%. The code can be found at \href{https://github.com/LLIKKE/HEAPr}{https://github.com/LLIKKE/HEAPr}.
Abstract（参考訳）: 大規模言語モデル (LLM) におけるMixture-of-Experts (MoE) アーキテクチャは、高密度LLMに比べて優れた性能と推論コストの低減を実現している。しかし、その大きなパラメータはメモリの要求を禁止し、実際のデプロイメントを制限します。既存のプルーニング法は主にエキスパートレベルのプルーニングに重点を置いているが、この粗い粒度はしばしば相当な精度の劣化をもたらす。本研究では, 専門家をより小さく, 識別不能な原子エキスパートに分解し, より正確で柔軟な原子プルーニングを可能にする新しいプルーニングアルゴリズムであるHEAPrを紹介する。それぞれの原子専門家の重要性を測定するために、最適脳サージオン(OBS)理論に似た原理に基づく2次情報を利用する。 2次情報によって引き起こされる計算と記憶の課題に対処するため、HEAPrは、原子専門家の固有の特性を利用して、2次情報を専門家パラメータから原子専門家パラメータに変換し、さらに原子専門家の出力の2次情報に単純化する。このアプローチは空間の複雑さを$O(d^4)$から$O(d^2)$に縮める。 HEAPrは2回の前方通過と1回の後方通過で原子の専門家の重要性を計算する。 DeepSeek MoEやQwen MoEファミリを含むMoEモデルに関する大規模な実験は、HEAPrが既存のエキスパートレベルのプルーニングメソッドを幅広い圧縮率とベンチマークで上回っていることを実証している。具体的には、ほとんどのモデルで20%～25%の圧縮比でほぼロスレス圧縮を実現し、FLOPを20%近く削減する。コードは \href{https://github.com/LLIKKE/HEAPr}{https://github.com/LLIKKE/HEAPr} で見ることができる。

論文の概要: HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space

関連論文リスト