Fugu-MT 論文翻訳(概要): SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference

論文の概要: SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference

arxiv url: http://arxiv.org/abs/2604.22575v1
Date: Fri, 24 Apr 2026 14:07:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-27 15:36:26.490595
Title: SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference
Title（参考訳）: SpikingBrain2.0: 効率的なロングコンテキストとクロスプラットフォーム推論のためのブレインインスパイアされたファンデーションモデル
Authors: Yuqi Pan, Jinghao Zhuang, Yupeng Feng, Fangzhi Zhong, Siyu Ding, Xuerui Qiu, Shaowei Gu, Bohan Sun, Zhiyong Qin, Yibo Zhong, Lingtao Ouyang, Kun Yang, Zehao Liu, Yuhong Chou, Shurong Wang, Anjie Hu, Han Xu, Bo Xu, Guoqi Li,
Abstract要約: 主な課題は、最小限のトレーニングオーバーヘッドでパフォーマンスと長期コンテキストの効率を維持する基盤モデルを設計することである。 SpikingBrain2.0(SpB2.0)は,前任者のアーキテクチャとトレーニング効率を両立させる5Bモデルである。
参考スコア（独自算出の注目度）: 28.709623208731028
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scaling context length is reshaping large-model development, yet full-attention Transformers suffer from prohibitive computation and inference bottlenecks at long sequences. A key challenge is to design foundation models that maintain performance and long-context efficiency with minimal training overhead. We introduce SpikingBrain2.0 (SpB2.0), a 5B model that advances both architecture and training efficiency of its predecessor. Our contributions are two-fold. (1) Architectural Innovation: We propose Dual-Space Sparse Attention (DSSA), an inter-layer hybrid of Sparse Softmax Attention (MoBA) and Sparse Linear Attention (SSE), achieving an improved performance-efficiency trade-off for long-context modeling. SpB2.0 further supports dual quantization paths: INT8-Spiking coding enables sparse event-driven computation, while FP8 coding accelerates inference on modern GPUs. (2) Enhanced Training Strategy: We develop an optimized Transformer-to-Hybrid (T2H) pipeline with dual conversion paths for LLMs and VLMs using curated open-source data. Empirically, SpB2.0-5B and SpB2.0-VL-5B recover most of the base Transformer (Qwen3-4B) capability with under 7k A100 GPU hours. SpB2.0 achieves a 10.13x TTFT speedup at 4M context and supports over 10M tokens on 8 A100 GPUs under vLLM, where full-attention models exceed memory limits. It also demonstrates strong cross-platform compatibility, enabling FP8 GPU inference (2.52x speedup at 250k) and efficient neuromorphic execution (64.31% sparsity, with 70.6% and 46.5% area and power reduction at 500MHz). Overall, SpikingBrain2.0 provides a practical pathway for lightweight, multimodal, spiking foundation models, highlighting the potential of combining brain-inspired mechanisms with efficient architectures for resource-constrained and edge scenarios.
Abstract（参考訳）: コンテキスト長のスケーリングは、大規模なモデル開発を形作るが、フルアテンショントランスフォーマーは、長いシーケンスでの計算の禁止と推論のボトルネックに悩まされる。重要な課題は、最小限のトレーニングオーバーヘッドでパフォーマンスと長期コンテキストの効率を維持する基盤モデルを設計することである。 SpikingBrain2.0(SpB2.0)は,前任者のアーキテクチャとトレーニング効率を両立させる5Bモデルである。私たちの貢献は2倍です。 1) 建築革新: ソフトマックス・アテンション(MoBA)とスパース線形アテンション(SSE)の層間ハイブリッドであるDual-Space Sparse Attention(DSSA)を提案する。 SpB2.0はさらにデュアル量子化パスをサポートする: INT8-Spikingコーディングはスパースイベント駆動型計算を可能にし、FP8コーディングは現代的なGPUでの推論を加速する。 2) 強化トレーニング戦略: オープンソースのデータを用いて, LLM と VLM の二重変換経路を最適化した Transformer-to-Hybrid (T2H) パイプラインを開発する。 SpB2.0-5BとSpB2.0-VL-5Bは、ベーストランスフォーマー(Qwen3-4B)の能力を7k A100 GPU時間以下で回復する。 SpB2.0は4Mコンテキストで10.13倍のTTFTスピードアップを実現し、vLLMの8A100 GPU上で10万以上のトークンをサポートする。また、FP8 GPU推論(250kでの2.52倍の高速化)と効率的なニューロモルフィック実行(64.31%の間隔、70.6%と46.5%の領域、500MHzでの電力削減)を可能にした。全体として、SpikeBrain2.0は、軽量でマルチモーダルなスパイク基盤モデルのための実践的なパスを提供し、脳にインスパイアされたメカニズムと、リソース制約とエッジシナリオの効率的なアーキテクチャを組み合わせる可能性を強調している。

論文の概要: SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference

関連論文リスト