Fugu-MT 論文翻訳(概要): SpikeMLLM: Spike-based Multimodal Large Language Models via Modality-Specific Temporal Scales and Temporal Compression

論文の概要: SpikeMLLM: Spike-based Multimodal Large Language Models via Modality-Specific Temporal Scales and Temporal Compression

arxiv url: http://arxiv.org/abs/2604.18610v1
Date: Mon, 13 Apr 2026 15:32:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-22 22:41:49.35245
Title: SpikeMLLM: Spike-based Multimodal Large Language Models via Modality-Specific Temporal Scales and Temporal Compression
Title（参考訳）: SpikeMLLM:Modality-Specific Temporal Scales と Temporal Compressionによるスパイクに基づくマルチモーダル言語モデル
Authors: Han Xu, Zhiyong Qin, Di Shang, Jiahong Zhang, Xuerui Qiu, Bo Lei, Tiejun Huang, Bo Xu, Guoqi Li,
Abstract要約: スパイキングニューラルネットワーク(SNN)は、ニューロモルフィックハードウェアに固有のエネルギー効率の利点を提供する。 MLLMの最初のスパイクベースフレームワークであるSpikeMLLMを提案する。この結果から,SpikeMLLMはアグレッシブ・タイムステップ圧縮下でほぼ無作為な性能を維持していることがわかった。
参考スコア（独自算出の注目度）: 46.709828328948724
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable progress but incur substantial computational overhead and energy consumption during inference, limiting deployment in resource-constrained environments. Spiking Neural Networks (SNNs), with their sparse event-driven computation, offer inherent energy efficiency advantages on neuromorphic hardware, yet extending them to MLLMs faces two key challenges: heterogeneous modalities make uniform spike encoding insufficient, and high-resolution image inputs amplify timestep unfolding overhead. We propose SpikeMLLM, the first spike-based framework for MLLMs, which unifies existing ANN quantization methods in the spiking representation space and incorporates Modality-Specific Temporal Scales (MSTS) guided by Modality Evolution Discrepancy (MED) and Temporally Compressed LIF (TC-LIF) for timestep compression from T=L-1 to T=log2(L)-1. Experiments on four representative MLLMs across diverse multimodal benchmarks show that SpikeMLLM maintains near-lossless performance under aggressive timestep compression (Tv/Tt=3/4), with average gaps of only 0.72% and 1.19% relative to the FP16 baseline on InternVL2-8B and Qwen2VL-72B. We further develop a dedicated RTL accelerator tailored to the spike-driven datapath, observing 9.06x higher throughput and 25.8x better power efficiency relative to an FP16 GPU baseline under a deployment-oriented co-design setting, suggesting the promise of algorithm-hardware co-design for efficient multimodal intelligence.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は大きな進歩を遂げているが、推論中にかなりの計算オーバーヘッドとエネルギー消費を発生させ、資源制約のある環境への展開を制限している。スパイキングニューラルネットワーク(SNN)は、イベント駆動の計算が少ないため、ニューロモルフィックハードウェアに固有のエネルギー効率上の利点を提供するが、MLLMに拡張することは、2つの大きな課題に直面している。 MLLMの最初のスパイクベースのフレームワークであるSpikeMLLMは、スパイク表現空間における既存のANN量子化手法を統一し、T=L-1からT=log2(L)-1までのタイムステップ圧縮のために、Modality Evolution Discrepancy(MED)とT=TC-LIF(TC-LIF)によって導かれるModality-Specific Temporal Scales(MSTS)を組み込む。様々なマルチモーダルベンチマークによる4つの代表的MLLMの実験では、SpikeMLLMはアグレッシブなタイムステップ圧縮(Tv/Tt=3/4)の下でほぼロスレス性能を維持しており、平均的なギャップはInternVL2-8BとQwen2VL-72BのFP16ベースラインに対してわずか0.72%と1.19%である。さらに、スパイク駆動型データパスに合わせた専用のRTLアクセラレータを開発し、デプロイ指向のコデザイン設定の下でFP16 GPUベースラインと比較して9.06倍高いスループットと25.8倍の電力効率を観測し、効率的なマルチモーダルインテリジェンスのためのアルゴリズムハードウェアの共同設計の可能性を示唆している。

論文の概要: SpikeMLLM: Spike-based Multimodal Large Language Models via Modality-Specific Temporal Scales and Temporal Compression

関連論文リスト