Large Language Models Inference Engines based on Spiking Neural Networks
- URL: http://arxiv.org/abs/2510.00133v3
- Date: Tue, 14 Oct 2025 22:02:02 GMT
- Title: Large Language Models Inference Engines based on Spiking Neural Networks
- Authors: Adarsha Balaji, Sandeep Madireddy, Prasanna Balaprakash,
- Abstract summary: We explore spiking neural networks (SNNs) to design transformer models.<n>A challenge in training large-scale SNNs is inefficient and time-consuming.<n>We propose NeurTransformer, a methodology for designing transformer-based SNN for inference.
- Score: 5.529385616266398
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Foundational models based on the transformer architecture are currently the state-of-the-art in general language modeling, as well as in scientific areas such as material science and climate. However, training and deploying these models is computationally challenging as the time and space complexity has a quadratic relation to the input sequence length. Several efforts exploring efficient computational paradigms and model architectures to address these limitations have been made. In this work, we explore spiking neural networks (SNNs) to design transformer models. A challenge in training large-scale SNNs, using existing surrogate learning methods is inefficient and time-consuming. On the other hand, techniques to convert existing transformer-based models to their SNN equivalent are not scalable, as achieving optimal performance comes at the cost of a large number of spike time-steps, i.e. increased latency. To address this, we propose NeurTransformer, a methodology for designing transformer-based SNN for inference using a supervised fine-tuning approach with existing conversion methods. The proposed methodology works by: (1) replacing the self-attention mechanism with a spike-based self-attention (SSA), (2) converting the feed-forward block of the trained transformer model to its equivalent SNN, and (3) fine-tuning the SSA block using SNN-based surrogate learning algorithms. We benchmark the proposed methodology and demonstrate its accuracy and scalability using three variants of the GPT-2 model of increasing model size. We observe that the converted GPT-2 small models demonstrate a 5-12% loss in cosine similarity and a 9.7% reduction in perplexity. Finally, we demonstrate the energy efficiency of the SSA block compared to the ASA block and show between 64.71% and 85.28% reductions in estimated energy consumption when implementing the self-attention mechanism on a digital hardware.
Related papers
- NoiseFormer -- Noise Diffused Symmetric Attention Transformer [0.0]
We propose a novel unified model architecture called Noise Diffused Symmetric Attention Transformer to enhance the model's performance.<n>The proposed model is validated upon GPT2 base model and the results reflect the performance gains falling between plain Symmetric attention and GPT2 base model.
arXiv Detail & Related papers (2026-01-10T14:10:48Z) - CSDformer: A Conversion Method for Fully Spike-Driven Transformer [11.852241487470797]
Spike-based transformer is a novel architecture aiming to enhance the performance of spiking neural networks.<n>We propose CSDformer, a novel conversion method for fully spike-driven transformers.<n>CSDformer achieves high performance under ultra-low latency, while dramatically reducing both computational complexity and training overhead.
arXiv Detail & Related papers (2025-09-22T07:55:03Z) - Learning Transformer-based World Models with Contrastive Predictive Coding [58.0159270859475]
We show that the next state prediction objective is insufficient to fully exploit the representation capabilities of Transformers.<n>We propose to extend world model predictions to longer time horizons by introducing TWISTER, a world model using action-conditioned Contrastive Predictive Coding.<n>TWISTER achieves a human-normalized mean score of 162% on the Atari 100k benchmark, setting a new record among state-of-the-art methods that do not employ look-ahead search.
arXiv Detail & Related papers (2025-03-06T13:18:37Z) - BHViT: Binarized Hybrid Vision Transformer [53.38894971164072]
Model binarization has made significant progress in enabling real-time and energy-efficient computation for convolutional neural networks (CNN)<n>We propose BHViT, a binarization-friendly hybrid ViT architecture and its full binarization model with the guidance of three important observations.<n>Our proposed algorithm achieves SOTA performance among binary ViT methods.
arXiv Detail & Related papers (2025-03-04T08:35:01Z) - Towards High-performance Spiking Transformers from ANN to SNN Conversion [43.53538629484375]
Spiking neural networks (SNNs) show great potential due to their energy efficiency, fast processing capabilities, and robustness.<n>Current conversion methods mainly focus on converting convolutional neural networks (CNNs) to SNNs.<n>In this paper, we propose an Expectation Compensation Module to preserve accuracy of the conversion.
arXiv Detail & Related papers (2025-02-28T16:12:37Z) - Binary Event-Driven Spiking Transformer [36.815359983551986]
Transformer-based Spiking Neural Networks (SNNs) introduce a novel event-driven self-attention paradigm.<n>We propose the Binary Event-Driven Spiking Transformer, i.e. BESTformer.<n> BESTformer suffers from a severe performance drop from its full-precision counterpart due to the limited representation capability of binarization.
arXiv Detail & Related papers (2025-01-10T12:00:11Z) - Adaptive Calibration: A Unified Conversion Framework of Spiking Neural Network [1.5215973379400674]
Spiking Neural Networks (SNNs) are seen as an energy-efficient alternative to traditional Artificial Neural Networks (ANNs)<n>We present a unified training-free conversion framework that significantly enhances both the performance and efficiency of converted SNNs.
arXiv Detail & Related papers (2024-12-18T09:38:54Z) - Deep-Unrolling Multidimensional Harmonic Retrieval Algorithms on Neuromorphic Hardware [78.17783007774295]
This paper explores the potential of conversion-based neuromorphic algorithms for highly accurate and energy-efficient single-snapshot multidimensional harmonic retrieval.<n>A novel method for converting the complex-valued convolutional layers and activations into spiking neural networks (SNNs) is developed.<n>The converted SNNs achieve almost five-fold power efficiency at moderate performance loss compared to the original CNNs.
arXiv Detail & Related papers (2024-12-05T09:41:33Z) - Accelerating Toeplitz Neural Network with Constant-time Inference
Complexity [21.88774274472737]
Toeplitz Neural Networks (TNNs) have exhibited outstanding performance in various sequence modeling tasks.
They outperform commonly used Transformer-based models while benefiting from log-linear space-time complexities.
In this paper, we aim to combine the strengths of TNNs and State Space Models (SSMs) by converting TNNs to SSMs during inference.
arXiv Detail & Related papers (2023-11-15T07:50:57Z) - RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z) - MoEfication: Conditional Computation of Transformer Models for Efficient
Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z) - Efficient pre-training objectives for Transformers [84.64393460397471]
We study several efficient pre-training objectives for Transformers-based models.
We prove that eliminating the MASK token and considering the whole output during the loss are essential choices to improve performance.
arXiv Detail & Related papers (2021-04-20T00:09:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.