Related papers: FlashRNN: Optimizing Traditional RNNs on Modern Hardware

FlashRNN: Optimizing Traditional RNNs on Modern Hardware

URL: http://arxiv.org/abs/2412.07752v2
Date: Mon, 13 Jan 2025 17:34:22 GMT
Title: FlashRNN: Optimizing Traditional RNNs on Modern Hardware
Authors: Korbinian Pöppel, Maximilian Beck, Sepp Hochreiter,
Abstract summary: State-tracking capabilities are important for time-series tasks and logical reasoning.<n>Traditional RNNs like LSTMs and GRUs do have these capabilities at the cost of strictly sequential processing.<n>We show how fast these networks can get with our hardware-optimization FlashRNN in Triton and optimizing kernels to the register level.
Score: 6.749483762719583
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While Transformers and other sequence-parallelizable neural network architectures seem like the current state of the art in sequence modeling, they specifically lack state-tracking capabilities. These are important for time-series tasks and logical reasoning. Traditional RNNs like LSTMs and GRUs, as well as modern variants like sLSTM do have these capabilities at the cost of strictly sequential processing. While this is often seen as a strong limitation, we show how fast these networks can get with our hardware-optimization FlashRNN in Triton and CUDA, optimizing kernels to the register level on modern GPUs. We extend traditional RNNs with a parallelization variant that processes multiple RNNs of smaller hidden state in parallel, similar to the head-wise processing in Transformers. To enable flexibility on different GPU variants, we introduce a new optimization framework for hardware-internal cache sizes, memory and compute handling. It models the hardware in a setting using polyhedral-like constraints, including the notion of divisibility. This speeds up the solution process in our ConstrINT library for general integer constraint satisfaction problems (integer CSPs). We show that our kernels can achieve 50x speed-ups over a vanilla PyTorch implementation and allow 40x larger hidden sizes compared to our Triton implementation. Our open-source kernels and the optimization library are released here to boost research in the direction of state-tracking enabled RNNs and sequence modeling: \url{https://github.com/NX-AI/flashrnn}

Related papers

FeNN: A RISC-V vector processor for Spiking Neural Network acceleration [0.1350479308585481]
Spiking Neural Networks (SNNs) have the potential to drastically reduce the energy requirements of AI systems.<n>Here, we present a novel RISC-V-based soft vector processor (FeNN) tailored to SNN simulating on FPGAs.
arXiv Detail & Related papers (2025-06-13T13:13:54Z)
msf-CNN: Patch-based Multi-Stage Fusion with Convolutional Neural Networks for TinyML [0.4297070083645049]
We introduce msf-CNN, a novel technique that efficiently finds optimal fusion settings for convolutional neural networks (CNNs)<n>We show that msf-CNN can achieve inference using 50% less RAM compared to the prior art.
arXiv Detail & Related papers (2025-05-16T17:47:15Z)
Tiled Flash Linear Attention: More Efficient Linear RNN and xLSTM Kernels [14.756974816917584]
Linear RNNs with gating recently demonstrated competitive performance compared to Transformers in language modeling. We present Tiled Flash Linear Attention (TFLA), a novel kernel algorithm for linear RNNs, that enables arbitrary large chunk sizes. In our speed benchmarks, we show that our new mLSTM kernels based on TFLA outperform highly optimized Flash Attention, Linear Attention and Mamba kernels.
arXiv Detail & Related papers (2025-03-18T16:09:47Z)
Fixed-Point RNNs: From Diagonal to Dense in a Few Iterations [10.851383867834052]
We compute a dense linear RNN as the fixed-point of a parallelizable diagonal linear RNN in a single layer. We achieve state-of-the-art results on the commonly used toy tasks $A_5$, $S_5$, copying, and modular arithmetics.
arXiv Detail & Related papers (2025-03-13T18:50:22Z)
Were RNNs All We Needed? [55.822693848969855]
In this work, we revisit sequence modelling from a historical perspective, focusing on Recurrent Neural Networks (RNNs)<n>We demonstrate that by simplifying these models, we can derive minimal versions (minLSTMs and minGRUs) that use fewer parameters than their traditional counterparts, are fully parallelizable during training, and achieve surprisingly competitive performance on a range of tasks, rivalling recent models including Transformers.
arXiv Detail & Related papers (2024-10-02T03:06:49Z)
Attention as an RNN [66.5420926480473]
We show that attention can be viewed as a special Recurrent Neural Network (RNN) with the ability to compute its textitmany-to-one RNN output efficiently. We introduce a new efficient method of computing attention's textitmany-to-many RNN output based on the parallel prefix scan algorithm. We show Aarens achieve comparable performance to Transformers on $38$ datasets spread across four popular sequential problem settings.
arXiv Detail & Related papers (2024-05-22T19:45:01Z)
Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency. We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z)
Instant Neural Graphics Primitives with a Multiresolution Hash Encoding [67.33850633281803]
We present a versatile new input encoding that permits the use of a smaller network without sacrificing quality. A small neural network is augmented by a multiresolution hash table of trainable feature vectors whose values are optimized through a gradient descent. We achieve a combined speed of several orders of magnitude, enabling training of high-quality neural graphics primitives in a matter of seconds.
arXiv Detail & Related papers (2022-01-16T07:22:47Z)
Sub-bit Neural Networks: Learning to Compress and Accelerate Binary Neural Networks [72.81092567651395]
Sub-bit Neural Networks (SNNs) are a new type of binary quantization design tailored to compress and accelerate BNNs. SNNs are trained with a kernel-aware optimization framework, which exploits binary quantization in the fine-grained convolutional kernel space. Experiments on visual recognition benchmarks and the hardware deployment on FPGA validate the great potentials of SNNs.
arXiv Detail & Related papers (2021-10-18T11:30:29Z)
Fully Spiking Variational Autoencoder [66.58310094608002]
Spiking neural networks (SNNs) can be run on neuromorphic devices with ultra-high speed and ultra-low energy consumption. In this study, we build a variational autoencoder (VAE) with SNN to enable image generation.
arXiv Detail & Related papers (2021-09-26T06:10:14Z)
VersaGNN: a Versatile accelerator for Graph neural networks [81.1667080640009]
We propose textitVersaGNN, an ultra-efficient, systolic-array-based versatile hardware accelerator. textitVersaGNN achieves on average 3712$times$ speedup with 1301.25$times$ energy reduction on CPU, and 35.4$times$ speedup with 17.66$times$ energy reduction on GPU.
arXiv Detail & Related papers (2021-05-04T04:10:48Z)
Neural Architecture Search as Program Transformation Exploration [7.090165638014331]
Compilers apply program transformations in order to exploit hardware parallelism and memory hierarchy. neural architecture search (NAS) techniques mutate networks by operations such as the grouping or bottlenecking of convolutions. In this work, we express such neural architecture operations as program transformations whose legality depends on a notion of representational capacity.
arXiv Detail & Related papers (2021-02-12T16:11:05Z)
PolyDL: Polyhedral Optimizations for Creation of High Performance DL primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives. We develop novel data reuse analysis algorithms using the polyhedral model. We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z)
Compiling Spiking Neural Networks to Neuromorphic Hardware [4.273223677453178]
Spiking Neural Network (SNN) can lower the energy consumption of machine learning applications executed on neuromorphic hardware. We propose an approach to analyze and compile SNNs on a resource-constrained neuromorphic hardware.
arXiv Detail & Related papers (2020-04-07T21:13:27Z)
TFApprox: Towards a Fast Emulation of DNN Approximate Hardware Accelerators on GPU [0.4817429789586127]
Energy efficiency of hardware accelerators of deep neural networks (DNN) can be improved by introducing approximate arithmetic circuits. A software emulation of the DNN accelerator is usually executed on CPU or GPU. This emulation is typically two or three orders of magnitude slower than a software DNN implementation running on or emulated.
arXiv Detail & Related papers (2020-02-21T08:22:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.