SpikingBrain: Spiking Brain-inspired Large Models
- URL: http://arxiv.org/abs/2509.05276v2
- Date: Sun, 19 Oct 2025 11:50:16 GMT
- Title: SpikingBrain: Spiking Brain-inspired Large Models
- Authors: Yuqi Pan, Yupeng Feng, Jinghao Zhuang, Siyu Ding, Han Xu, Zehao Liu, Bohan Sun, Yuhong Chou, Xuerui Qiu, Anlin Deng, Anjie Hu, Peng Zhou, Man Yao, Jibin Wu, Jian Yang, Guoliang Sun, Bo Xu, Guoqi Li,
- Abstract summary: SpikingBrain is a family of brain-inspired models for efficient long-context training and inference.<n>We develop two models: SpikingBrain-7B, a linear LLM, and SpikingBrain-76B, a hybrid-linear MoE LLM.<n>Our models significantly improve long-sequence training efficiency and deliver inference with (partially) constant memory and event-driven spiking behavior.
- Score: 42.41339012839023
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mainstream Transformer-based large language models face major efficiency bottlenecks: training computation scales quadratically with sequence length, and inference memory grows linearly, limiting long-context processing. Building large models on non-NVIDIA platforms also poses challenges for stable and efficient training. To address this, we introduce SpikingBrain, a family of brain-inspired models designed for efficient long-context training and inference. SpikingBrain leverages the MetaX GPU cluster and focuses on three aspects: (1) Model Architecture: linear and hybrid-linear attention architectures with adaptive spiking neurons; (2) Algorithmic Optimizations: an efficient, conversion-based training pipeline and a dedicated spike coding framework; (3) System Engineering: customized training frameworks, operator libraries, and parallelism strategies tailored to MetaX hardware. Using these techniques, we develop two models: SpikingBrain-7B, a linear LLM, and SpikingBrain-76B, a hybrid-linear MoE LLM. These models demonstrate the feasibility of large-scale LLM development on non-NVIDIA platforms. SpikingBrain achieves performance comparable to open-source Transformer baselines while using only about 150B tokens for continual pre-training. Our models significantly improve long-sequence training efficiency and deliver inference with (partially) constant memory and event-driven spiking behavior. For example, SpikingBrain-7B attains over 100x speedup in Time to First Token for 4M-token sequences. Training remains stable for weeks on hundreds of MetaX C550 GPUs, with the 7B model reaching a Model FLOPs Utilization of 23.4 percent. The proposed spiking scheme achieves 69.15 percent sparsity, enabling low-power operation. Overall, this work demonstrates the potential of brain-inspired mechanisms to drive the next generation of efficient and scalable large model design.
Related papers
- MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling [80.48332380100915]
MiniCPM-SALA is a hybrid model that integrates the high-fidelity long-context modeling of sparse attention with the global efficiency of linear attention.<n>On a single NVIDIA A6000D GPU, the model achieves up to 3.5x the inference speed of the full-attention model at the sequence length of 256K tokens.
arXiv Detail & Related papers (2026-02-12T09:37:05Z) - Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts [27.8245634187787]
We present HALO, a pipeline for distilling Transformer models into RNN-attention hybrid models.<n>We then present HypeNet, a hybrid architecture with superior length generalization enabled by a novel position encoding scheme.<n>The conversion requires just 2.3B tokens, less than 0.01% of their pre-training data.
arXiv Detail & Related papers (2026-01-29T18:59:53Z) - EA4LLM: A Gradient-Free Approach to Large Language Model Optimization via Evolutionary Algorithms [23.009274904878065]
We propose EA4LLM, an evolutionary algorithm for optimizing large language models (LLMs)<n>We empirically verify full- parameter optimization from the pretraining stage across model sizes ranging from 0.5B to 32B.<n>Our work challenges the prevailing assumption that gradient-based optimization is the only viable approach for training neural networks.
arXiv Detail & Related papers (2025-10-12T13:38:28Z) - Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models [49.911784762244814]
TraceRL is a trajectory-aware reinforcement learning framework for diffusion language models (DLMs)<n>We derive a series of state-of-the-art diffusion language models, namely TraDo.<n>TraDo-8B-Instruct achieves relative accuracy improvements of 6.1% over Qwen2.5-7B-Instruct and 51.3% over Llama3.1-8B-Instruct on mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-09-08T17:58:06Z) - LongCat-Flash Technical Report [165.66422346171862]
LongCat-Flash is a 560-billion- parameter Mixture-of-Experts (MoE) language model.<n>It is designed for both computational efficiency and advanced agentic capabilities.<n>We complete model training on more than 20 trillion tokens within 30 days, while achieving over 100 tokens per second (TPS) for inference at a cost of $0.70 per million output tokens.
arXiv Detail & Related papers (2025-09-01T10:05:45Z) - Reparameterized LLM Training via Orthogonal Equivalence Transformation [54.80172809738605]
We present POET, a novel training algorithm that uses Orthogonal Equivalence Transformation to optimize neurons.<n>POET can stably optimize the objective function with improved generalization.<n>We develop efficient approximations that make POET flexible and scalable for training large-scale neural networks.
arXiv Detail & Related papers (2025-06-09T17:59:34Z) - MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core [11.40633051522406]
We introduce an end-to-end training framework for large-scale MoE models.<n>MoE Parallel Folding is a novel strategy that decouples the parallelization of attention and MoE in Transformer models.<n>A flexible token-level dispatcher supports both token-dropping and token-dropless MoE training.
arXiv Detail & Related papers (2025-04-21T08:39:47Z) - Puzzle: Distillation-Based NAS for Inference-Optimized LLMs [17.72841008597783]
Large language models (LLMs) offer remarkable capabilities, yet their high inference costs restrict wider adoption.<n>We present Puzzle, a hardware-aware framework that accelerates the inference of LLMs while preserving their capabilities.<n>We showcase our framework's impact via Llama-3.1-Nemotron-51B-Instruct (Nemotron-51B) and Llama-3.3-Nemotron-49B, two publicly available models.
arXiv Detail & Related papers (2024-11-28T13:45:42Z) - Scaling Studies for Efficient Parameter Search and Parallelism for Large
Language Model Pre-training [2.875838666718042]
We focus on parallel and distributed machine learning algorithm development, specifically for optimizing the data processing and pre-training of a set of 5 encoder-decoder LLMs.
We performed a fine-grained study to quantify the relationships between three ML methods, specifically exploring Microsoft DeepSpeed Zero Redundancy stages.
arXiv Detail & Related papers (2023-10-09T02:22:00Z) - Scaling Distributed Deep Learning Workloads beyond the Memory Capacity
with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods.
We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods.
Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.