Fast Inner-Product Algorithms and Architectures for Deep Neural Network
Accelerators
- URL: http://arxiv.org/abs/2311.12224v1
- Date: Mon, 20 Nov 2023 22:37:20 GMT
- Title: Fast Inner-Product Algorithms and Architectures for Deep Neural Network
Accelerators
- Authors: Trevor E. Pogue, Nicola Nicolici
- Abstract summary: We introduce a new algorithm called the Free-pipeline Fast Inner Product (FFIP) and its hardware architecture.
FIP is applicable to all machine learning (ML) model layers that can mainly decompose to matrix multiplication.
We show that FFIP can be seamlessly incorporated into traditional fixed-point systolic array ML accelerators.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a new algorithm called the Free-pipeline Fast Inner Product
(FFIP) and its hardware architecture that improve an under-explored fast
inner-product algorithm (FIP) proposed by Winograd in 1968. Unlike the
unrelated Winograd minimal filtering algorithms for convolutional layers, FIP
is applicable to all machine learning (ML) model layers that can mainly
decompose to matrix multiplication, including fully-connected, convolutional,
recurrent, and attention/transformer layers. We implement FIP for the first
time in an ML accelerator then present our FFIP algorithm and generalized
architecture which inherently improve FIP's clock frequency and, as a
consequence, throughput for a similar hardware cost. Finally, we contribute
ML-specific optimizations for the FIP and FFIP algorithms and architectures. We
show that FFIP can be seamlessly incorporated into traditional fixed-point
systolic array ML accelerators to achieve the same throughput with half the
number of multiply-accumulate (MAC) units, or it can double the maximum
systolic array size that can fit onto devices with a fixed hardware budget. Our
FFIP implementation for non-sparse ML models with 8 to 16-bit fixed-point
inputs achieves higher throughput and compute efficiency than the best-in-class
prior solutions on the same type of compute platform.
Related papers
- EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE.
Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - Search for Efficient Large Language Models [52.98684997131108]
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research.
Weight pruning, quantization, and distillation have been embraced to compress LLMs, targeting memory reduction and inference acceleration.
Most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.
arXiv Detail & Related papers (2024-09-25T21:32:12Z) - Fast, Scalable, Energy-Efficient Non-element-wise Matrix Multiplication on FPGA [10.630802853096462]
Modern Neural Network (NN) architectures heavily rely on vast numbers of multiply-accumulate arithmetic operations.
This paper proposes a high- throughput, scalable and energy efficient non-element-wise matrix multiplication unit on FPGAs.
Using our AMU achieves up to 9x higher throughput and 112x higher energy efficiency over the state-of-the-art solutions for the FPGA-based Quantised Neural Network (QNN) accelerators.
arXiv Detail & Related papers (2024-07-02T15:28:10Z) - Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval.
We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z) - Accelerating Machine Learning Primitives on Commodity Hardware [0.0]
We present an extensive study of the Sliding Window convolution technique as a more efficient alternative to the commonly used General Matrix multiplication (GEMM) based convolution in Deep Neural Networks (DNNs)
Our results suggest that the Sliding Window computation kernels can outperform GEMM-based convolution on a CPU and even on dedicated hardware accelerators.
This could promote a wider adoption of AI on low-power and low-memory devices without the need for specialized hardware.
arXiv Detail & Related papers (2023-10-08T16:26:18Z) - Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and
Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks.
The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources.
This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z) - A fully pipelined FPGA accelerator for scale invariant feature transform
keypoint descriptor matching, [0.0]
We design a novel fully pipelined hardware accelerator architecture for SIFT keypoint descriptor matching.
The proposed hardware architecture is able to properly handle the memory bandwidth necessary for a fully-pipelined implementation.
Our hardware implementation is 15.7 times faster than the comparable software approach.
arXiv Detail & Related papers (2020-12-17T15:29:41Z) - Iterative Algorithm Induced Deep-Unfolding Neural Networks: Precoding
Design for Multiuser MIMO Systems [59.804810122136345]
We propose a framework for deep-unfolding, where a general form of iterative algorithm induced deep-unfolding neural network (IAIDNN) is developed.
An efficient IAIDNN based on the structure of the classic weighted minimum mean-square error (WMMSE) iterative algorithm is developed.
We show that the proposed IAIDNN efficiently achieves the performance of the iterative WMMSE algorithm with reduced computational complexity.
arXiv Detail & Related papers (2020-06-15T02:57:57Z) - Minimal Filtering Algorithms for Convolutional Neural Networks [82.24592140096622]
We develop fully parallel hardware-oriented algorithms for implementing the basic filtering operation for M=3,5,7,9, and 11.
A fully parallel hardware implementation of the proposed algorithms in each case gives approximately 30 percent savings in the number of embedded multipliers.
arXiv Detail & Related papers (2020-04-12T13:18:25Z) - SPEC2: SPECtral SParsE CNN Accelerator on FPGAs [31.31419913907224]
We propose SPEC2 -- the first work to prune and accelerate spectral CNNs.
We design an optimized pipeline architecture on FPGA that has efficient random access into sparse kernels.
The resulting accelerators achieve up to 24x higher throughput, compared with the state-of-the-art FPGA implementations for VGG16.
arXiv Detail & Related papers (2019-10-16T23:30:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.