Fugu-MT 論文翻訳(概要): Spatio-Temporal Pruning for Compressed Spiking Large Language Models

論文の概要: Spatio-Temporal Pruning for Compressed Spiking Large Language Models

arxiv url: http://arxiv.org/abs/2508.20122v1
Date: Sat, 23 Aug 2025 22:21:47 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-29 18:12:01.562209
Title: Spatio-Temporal Pruning for Compressed Spiking Large Language Models
Title（参考訳）: 圧縮スパイク大言語モデルのための時空間時空間プルーニング
Authors: Yi Jiang, Malyaban Bal, Brian Matejek, Susmit Jha, Adam Cobb, Abhronil Sengupta,
Abstract要約: 大規模言語モデル(LLM)は、大きなモデルサイズと高い推論遅延のため、エネルギー環境への展開に重大な課題をもたらす。高速な性能を維持しながら計算効率を最適化するスパイクLDMのための新しいスパイク時空プルーニングフレームワークを提案する。私たちのアプローチは、リアルタイムで低消費電力の自然言語処理アプリケーションに魅力的なソリューションを提供します。
参考スコア（独自算出の注目度）: 23.74945347657827
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) present significant challenges for deployment in energy-constrained environments due to their large model sizes and high inference latency. Spiking Neural Networks (SNNs), inspired by the sparse event-driven neural processing and energy-efficient information transmission in the brain, offer a promising alternative for achieving low-power computing. Integrating the event-driven efficiency of spiking neurons with the advanced capabilities of LLMs represents a promising direction for power-efficient LLMs. This work specifically delves into the design of compressed spiking LLMs. Here, we revisit spatial and temporal pruning from the perspective of SNNs and propose a novel spatio-temporal pruning framework for Spiking LLMs to optimize computational efficiency while preserving high performance. Our spatial pruning technique reduces the number of active neurons and attention heads, effectively lowering the computational complexity of the model. Meanwhile, temporal pruning minimizes inference latency by dynamically adjusting the number of timesteps required for different layers. By combining these approaches with other compression techniques, we present the first work in the domain of Spiking LLMs to jointly explore spatial pruning, temporal pruning, extreme quantization and knowledge distillation strategies. Extensive experimental evaluation of our proposed framework for SpikingBERT on the large-scale GLUE benchmark demonstrates the efficacy of our approach in terms of computational operations and inference latency. Our approach offers a compelling solution for real-time, low-power natural language processing applications, making Spiking LLMs more practical for deployment on edge devices and in power-constrained settings.
Abstract（参考訳）: 大きな言語モデル(LLM)は、大きなモデルサイズと高い推論遅延のため、エネルギー制約のある環境にデプロイする上で大きな課題となる。スパースイベント駆動ニューラル処理と脳内のエネルギー効率の高い情報伝達にインスパイアされたスパイキングニューラルネットワーク(SNN)は、低消費電力コンピューティングを実現するための有望な代替手段を提供する。スパイキングニューロンの事象駆動効率とLSMの高度な能力を統合することは、電力効率の高いLSMにとって有望な方向である。この研究は、圧縮スパイリング LLM の設計に特化している。そこで我々は,SNNの観点から空間的および時間的プルーニングを再検討し,高い性能を維持しながら計算効率を最適化するスパイキングLLMのための新しい時空間プルーニングフレームワークを提案する。我々の空間的プルーニング技術は、活動ニューロンと注意ヘッドの数を減らし、モデルの計算複雑性を効果的に減らします。一方、時間的プルーニングは、異なるレイヤに必要なタイムステップ数を動的に調整することで、推論遅延を最小限に抑える。これらの手法を他の圧縮手法と組み合わせることで、空間的刈り込み、時間的刈り込み、極端量子化、知識蒸留戦略を共同で探求するスパイキングLLMの分野における最初の成果を示す。大規模GLUEベンチマークを用いて提案したSpkingBERTフレームワークの大規模実験により,計算演算と推論遅延の観点から,提案手法の有効性を実証した。我々のアプローチは、リアルタイムで低消費電力の自然言語処理アプリケーションに魅力的なソリューションを提供し、Spike LLMはエッジデバイスや電力制約のある環境でのデプロイをより実用的なものにします。

論文の概要: Spatio-Temporal Pruning for Compressed Spiking Large Language Models

関連論文リスト