Low-Latency Online Multiplier with Reduced Activities and Minimized
Interconnect for Inner Product Arrays
- URL: http://arxiv.org/abs/2304.12946v1
- Date: Thu, 6 Apr 2023 01:22:27 GMT
- Title: Low-Latency Online Multiplier with Reduced Activities and Minimized
Interconnect for Inner Product Arrays
- Authors: Muhammad Usman, Milos Ercegovac, Jeong-A Lee
- Abstract summary: This paper proposes a low latency multiplier based on online or left-to-right arithmetic.
Online arithmetic enables overlapping successive operations regardless of data dependency.
Serial nature of the online algorithm and gradual increment/decrement of active slices minimize the interconnects and signal activities.
- Score: 0.8078491757252693
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multiplication is indispensable and is one of the core operations in many
modern applications including signal processing and neural networks.
Conventional right-to-left (RL) multiplier extensively contributes to the power
consumption, area utilization and critical path delay in such applications.
This paper proposes a low latency multiplier based on online or left-to-right
(LR) arithmetic which can increase throughput and reduce latency by digit-level
pipelining. Online arithmetic enables overlapping successive operations
regardless of data dependency because of the most significant digit first mode
of operation. To produce most significant digit first, it uses redundant number
system and we can have a carry-free addition, therefore, the delay of the
arithmetic operation is independent of operand bit width. The operations are
performed digit by digit serially from left to right which allows gradual
increase in the slice activities making it suitable for implementation on
reconfigurable devices. Serial nature of the online algorithm and gradual
increment/decrement of active slices minimize the interconnects and signal
activities resulting in overall reduction of area and power consumption. We
present online multipliers with; both inputs in serial, and one in serial and
one in parallel. Pipelined and non-pipelined designs of the proposed
multipliers have been synthesized with GSCL 45nm technology on Synopsys Design
Compiler. Thorough comparative analysis has been performed using widely used
performance metrics. The results show that the proposed online multipliers
outperform the RL multipliers.
Related papers
- Multi-qubit Lattice Surgery Scheduling [3.7126786554865774]
A quantum circuit can be transpiled into a sequence of solely non-Clifford multi-qubit gates.
We show that the transpilation significantly reduces the circuit length on the set of circuits tested.
The resulting circuit of multi-qubit gates has a further reduction in the expected circuit execution time compared to serial execution.
arXiv Detail & Related papers (2024-05-27T22:41:41Z) - Fast, Scalable, Warm-Start Semidefinite Programming with Spectral
Bundling and Sketching [53.91395791840179]
We present Unified Spectral Bundling with Sketching (USBS), a provably correct, fast and scalable algorithm for solving massive SDPs.
USBS provides a 500x speed-up over the state-of-the-art scalable SDP solver on an instance with over 2 billion decision variables.
arXiv Detail & Related papers (2023-12-19T02:27:22Z) - DSLOT-NN: Digit-Serial Left-to-Right Neural Network Accelerator [0.6435156676256051]
We propose a Digit-Serial Left-tO-righT arithmetic based processing technique called DSLOT-NN.
The proposed work has the ability to assess and terminate the ineffective convolutions which results in massive power and energy savings.
arXiv Detail & Related papers (2023-09-12T07:36:23Z) - ADC/DAC-Free Analog Acceleration of Deep Neural Networks with Frequency
Transformation [2.7488316163114823]
This paper proposes a novel approach to an energy-efficient acceleration of frequency-domain neural networks by utilizing analog-domain frequency-based tensor transformations.
Our approach achieves more compact cells by eliminating the need for trainable parameters in the transformation matrix.
On a 16$times$16 crossbars, for 8-bit input processing, the proposed approach achieves the energy efficiency of 1602 tera operations per second per Watt.
arXiv Detail & Related papers (2023-09-04T19:19:39Z) - ReLU and Addition-based Gated RNN [1.484528358552186]
We replace the multiplication and sigmoid function of the conventional recurrent gate with addition and ReLU activation.
This mechanism is designed to maintain long-term memory for sequence processing but at a reduced computational cost.
arXiv Detail & Related papers (2023-08-10T15:18:16Z) - Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - Multiplier with Reduced Activities and Minimized Interconnect for Inner
Product Arrays [0.8078491757252693]
We present a pipelined multiplier with reduced activities and minimized interconnect based on online digit-serial arithmetic.
For $8, 16, 24$ and $32$ bit precision, the proposed low power pipelined design show upto $38%$ and $44%$ reduction in power and area respectively.
arXiv Detail & Related papers (2022-04-11T05:45:43Z) - Scaling the Convex Barrier with Sparse Dual Algorithms [141.4085318878354]
We present two novel dual algorithms for tight and efficient neural network bounding.
Both methods recover the strengths of the new relaxation: tightness and a linear separation oracle.
We can obtain better bounds than off-the-shelf solvers in only a fraction of their running time.
arXiv Detail & Related papers (2021-01-14T19:45:17Z) - WrapNet: Neural Net Inference with Ultra-Low-Resolution Arithmetic [57.07483440807549]
We propose a method that adapts neural networks to use low-resolution (8-bit) additions in the accumulators, achieving classification accuracy comparable to their 32-bit counterparts.
We demonstrate the efficacy of our approach on both software and hardware platforms.
arXiv Detail & Related papers (2020-07-26T23:18:38Z) - Straggler-aware Distributed Learning: Communication Computation Latency
Trade-off [56.08535873173518]
Straggling workers can be tolerated by assigning redundant computations and coding across data and computations.
In most existing schemes, each non-straggling worker transmits one message per iteration to the parameter server (PS) after completing all its computations.
Imposing such a limitation results in two main drawbacks; over-computation due to inaccurate prediction of the straggling behaviour, and under-utilization due to treating workers as straggler/non-straggler.
arXiv Detail & Related papers (2020-04-10T08:39:36Z) - Accelerating Feedforward Computation via Parallel Nonlinear Equation
Solving [106.63673243937492]
Feedforward computation, such as evaluating a neural network or sampling from an autoregressive model, is ubiquitous in machine learning.
We frame the task of feedforward computation as solving a system of nonlinear equations. We then propose to find the solution using a Jacobi or Gauss-Seidel fixed-point method, as well as hybrid methods of both.
Our method is guaranteed to give exactly the same values as the original feedforward computation with a reduced (or equal) number of parallelizable iterations, and hence reduced time given sufficient parallel computing power.
arXiv Detail & Related papers (2020-02-10T10:11:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.