LMUFormer: Low Complexity Yet Powerful Spiking Model With Legendre
Memory Units
- URL: http://arxiv.org/abs/2402.04882v1
- Date: Sat, 20 Jan 2024 01:10:18 GMT
- Title: LMUFormer: Low Complexity Yet Powerful Spiking Model With Legendre
Memory Units
- Authors: Zeyu Liu, Gourav Datta, Anni Li, Peter Anthony Beerel
- Abstract summary: Transformer models have demonstrated high accuracy in numerous applications but have high complexity and lack sequential processing capability.
We show how architectural modifications to a recurrent model can help push its performance toward Transformer models.
We present a spiking version of this architecture, which introduces the benefit of states within the patch embedding and channel mixer modules.
- Score: 5.830814457423021
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer models have demonstrated high accuracy in numerous applications
but have high complexity and lack sequential processing capability making them
ill-suited for many streaming applications at the edge where devices are
heavily resource-constrained. Thus motivated, many researchers have proposed
reformulating the transformer models as RNN modules which modify the
self-attention computation with explicit states. However, these approaches
often incur significant performance degradation. The ultimate goal is to
develop a model that has the following properties: parallel training, streaming
and low-cost inference, and SOTA performance. In this paper, we propose a new
direction to achieve this goal. We show how architectural modifications to a
recurrent model can help push its performance toward Transformer models while
retaining its sequential processing capability. Specifically, inspired by the
recent success of Legendre Memory Units (LMU) in sequence learning tasks, we
propose LMUFormer, which augments the LMU with convolutional patch embedding
and convolutional channel mixer. Moreover, we present a spiking version of this
architecture, which introduces the benefit of states within the patch embedding
and channel mixer modules while simultaneously reducing the computing
complexity. We evaluated our architectures on multiple sequence datasets. In
comparison to SOTA transformer-based models within the ANN domain on the SCv2
dataset, our LMUFormer demonstrates comparable performance while necessitating
a remarkable 53 times reduction in parameters and a substantial 65 times
decrement in FLOPs. Additionally, owing to our model's proficiency in real-time
data processing, we can achieve a 32.03% reduction in sequence length, all
while incurring an inconsequential decline in performance. Our code is publicly
available at https://github.com/zeyuliu1037/LMUFormer.git.
Related papers
- Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences.
We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook.
LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z) - Mamba: Linear-Time Sequence Modeling with Selective State Spaces [31.985243136674146]
Foundation models are almost universally based on the Transformer architecture and its core attention module.
We identify that a key weakness of such models is their inability to perform content-based reasoning.
We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even blocks (Mamba)
As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics.
arXiv Detail & Related papers (2023-12-01T18:01:34Z) - Harnessing Manycore Processors with Distributed Memory for Accelerated
Training of Sparse and Recurrent Models [43.1773057439246]
Current AI training infrastructure is dominated by single instruction multiple data (SIMD) and systolic array architectures.
We explore sparse and recurrent model training on a massively parallel multiple instruction multiple data architecture with distributed local memory.
arXiv Detail & Related papers (2023-11-07T23:18:35Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - Fourier Transformer: Fast Long Range Modeling by Removing Sequence
Redundancy with FFT Operator [24.690247474891958]
Fourier Transformer is able to significantly reduce computational costs while retain the ability to inherit from various large pretrained models.
Our model achieves state-of-the-art performances among all transformer-based models on the long-range modeling benchmark LRA.
For generative seq-to-seq tasks including CNN/DailyMail and ELI5, by inheriting the BART weights our model outperforms the standard BART.
arXiv Detail & Related papers (2023-05-24T12:33:06Z) - RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z) - Parameter-efficient Tuning of Large-scale Multimodal Foundation Model [68.24510810095802]
We propose A graceful prompt framework for cross-modal transfer (Aurora) to overcome these challenges.
Considering the redundancy in existing architectures, we first utilize the mode approximation to generate 0.1M trainable parameters to implement the multimodal prompt tuning.
A thorough evaluation on six cross-modal benchmarks shows that it not only outperforms the state-of-the-art but even outperforms the full fine-tuning approach.
arXiv Detail & Related papers (2023-05-15T06:40:56Z) - DCT-Former: Efficient Self-Attention with Discrete Cosine Transform [4.622165486890318]
An intrinsic limitation of the Trasformer architectures arises from the computation of the dot-product attention.
Our idea takes inspiration from the world of lossy data compression (such as the JPEG algorithm) to derive an approximation of the attention module.
An extensive section of experiments shows that our method takes up less memory for the same performance, while also drastically reducing inference time.
arXiv Detail & Related papers (2022-03-02T15:25:27Z) - Parallelizing Legendre Memory Unit Training [5.076419064097734]
A new recurrent neural network (RNN) named the Legendre Memory Unit (LMU) was proposed and shown to achieve state-of-the-art performance on several benchmark datasets.
Here we leverage the linear time-invariant (LTI) memory component of the LMU to construct a simplified variant that can be parallelized during training.
We show that this reformulation that aids parallelizing, which can be applied generally to any deep network whose recurrent components are linear, makes training up to 200 times faster.
arXiv Detail & Related papers (2021-02-22T23:43:47Z) - Convolutional Tensor-Train LSTM for Spatio-temporal Learning [116.24172387469994]
We propose a higher-order LSTM model that can efficiently learn long-term correlations in the video sequence.
This is accomplished through a novel tensor train module that performs prediction by combining convolutional features across time.
Our results achieve state-of-the-art performance-art in a wide range of applications and datasets.
arXiv Detail & Related papers (2020-02-21T05:00:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.