Fugu-MT 論文翻訳(概要): Spike-Aware C++ INT8 Inference for Sparse Spiking Language Models on Commodity CPUs

論文の概要: Spike-Aware C++ INT8 Inference for Sparse Spiking Language Models on Commodity CPUs

arxiv url: http://arxiv.org/abs/2606.03026v1
Date: Tue, 02 Jun 2026 02:03:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-03 22:00:04.686651
Title: Spike-Aware C++ INT8 Inference for Sparse Spiking Language Models on Commodity CPUs
Title（参考訳）: 商品CPU上のスパーススパイク言語モデルに対するスパイク対応C++ INT8推論
Authors: Ting Liu,
Abstract要約: スパイク言語モデルは、高密度のTransformerランタイムが直接利用しないアクティベーション空間を公開します。スパースバイナリスパイク状態を実行プリミティブとして扱うC++ CPU推論ランタイムを実装した。スパイク対応の実行は、スパース言語モデルのCPUスループットとメモリ動作を改善することができる。
参考スコア（独自算出の注目度）: 8.419155861590548
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Spiking language models expose activation sparsity that dense Transformer runtimes do not directly exploit. This paper studies that property from a systems perspective. Building on the SymbolicLight V1 spike-gated language model family, we implement a C++ CPU inference runtime that treats sparse binary spike states as an execution primitive rather than only applying post-hoc weight compression. The runtime combines a manifest-driven weight loader, mixed row/column memory layout, AVX2/FMA kernels, per-channel symmetric INT8 quantization, and integer-domain accumulation for spike-conditioned sparse paths. On an AMD Ryzen 7 5800X, an early scalar FP32 baseline decodes at 9.5 tokens/s. Mixed-layout AVX2 FP32 raises this to 14.7 tokens/s, and AVX2 INT8 reaches 19.9 tokens/s on the same step-30k export while reducing the weight footprint from 3.49 GB to 1.06 GB. For the available 186k-step 874M-parameter INT8 export, the C++ runtime decodes at 22.63 tokens/s in a single-thread CPU benchmark, compared with 16.31 tokens/s for TinyLlama-1.1B Q8_0, 11.26 tokens/s for Falcon3-1B Q8_0, and 9.70 tokens/s for Qwen2.5-1.5B Q8_0 under llama.cpp. Thread scaling reaches 47.90 tokens/s at four CPU threads, and 512-token prefill improves from 29.86 to 94.68 tokens/s from one to eight threads. The throughput result comes with a quality cost: the SNN reports WikiText-2 perplexity 24.80, worse than the dense baselines in the same benchmark. We frame the result as an inference-systems study for sparse language runtimes, with longer-term motivation in embodied and edge agents that may benefit from local, low-core inference near sensors and actuators. Spike-aware execution can improve CPU throughput and memory behavior for sparse spiking language models, while model quality, controlled dense training baselines, embodied-task evaluation, and measured CPU energy remain open problems.
Abstract（参考訳）: スパイク言語モデルは、高密度のTransformerランタイムが直接利用しないアクティベーション空間を公開します。本稿では,その特性をシステムの観点から考察する。シンボリックライトV1スパイクゲート言語モデルファミリをベースとして,スパースバイナリスパイク状態をポストホック重み圧縮のみを適用するのではなく,実行プリミティブとして扱うC++ CPU推論ランタイムを実装した。ランタイムは、マニフェスト駆動の重みローダ、混合行/カラムメモリレイアウト、AVX2/FMAカーネル、チャネルごとの対称INT8量子化、スパイク条件のスパースパスの整数領域蓄積を組み合わせた。 AMD Ryzen 7 5800Xでは、初期のスカラーFP32が9.5トークン/秒でデコードされる。 AVX2 FP32は14.7トークン/sに増加し、AVX2 INT8は同じステップ30kエクスポートで19.9トークン/sに到達し、重量を3.49GBから1.06GBに削減した。利用可能な186kステップの874MパラメータINT8のエクスポートでは、C++ランタイムはシングルスレッドCPUベンチマークで22.63トークン/sでデコードされ、TinyLlama-1.1B Q8_0の16.31トークン/s、Falcon3-1B Q8_0の11.26トークン/s、llama.cppの下のQwen2.5-1.5B Q8_0の9.70トークン/sである。スレッドスケーリングは4つのCPUスレッドで47.90トークン/秒に達し、512トークンのプリフィルは29.86から94.68トークン/秒が1から8スレッドに改善された。 SNN は WikiText-2 perplexity 24.80 を報告している。この結果は,センサやアクチュエータ近傍の局所的,低コアな推論の恩恵を受ける可能性のある,エンボディエージェントとエッジエージェントの長期的モチベーションを備えた,スパース言語ランタイムの推論システムとして評価された。スパイクを意識した実行は、スパーススパイク言語モデルのCPUスループットとメモリ動作を改善する一方で、モデル品質、高密度トレーニングベースラインの制御、具体的タスク評価、測定されたCPUエネルギーは未解決の問題のままである。

論文の概要: Spike-Aware C++ INT8 Inference for Sparse Spiking Language Models on Commodity CPUs

関連論文リスト