Related papers: Low-Latency Online Multiplier with Reduced Activities and Minimized Interconnect for Inner Product Arrays

Low-Latency Online Multiplier with Reduced Activities and Minimized Interconnect for Inner Product Arrays

URL: http://arxiv.org/abs/2304.12946v1
Date: Thu, 6 Apr 2023 01:22:27 GMT
Title: Low-Latency Online Multiplier with Reduced Activities and Minimized Interconnect for Inner Product Arrays
Authors: Muhammad Usman, Milos Ercegovac, Jeong-A Lee
Abstract summary: This paper proposes a low latency multiplier based on online or left-to-right arithmetic. Online arithmetic enables overlapping successive operations regardless of data dependency. Serial nature of the online algorithm and gradual increment/decrement of active slices minimize the interconnects and signal activities.
Score: 0.8078491757252693
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multiplication is indispensable and is one of the core operations in many modern applications including signal processing and neural networks. Conventional right-to-left (RL) multiplier extensively contributes to the power consumption, area utilization and critical path delay in such applications. This paper proposes a low latency multiplier based on online or left-to-right (LR) arithmetic which can increase throughput and reduce latency by digit-level pipelining. Online arithmetic enables overlapping successive operations regardless of data dependency because of the most significant digit first mode of operation. To produce most significant digit first, it uses redundant number system and we can have a carry-free addition, therefore, the delay of the arithmetic operation is independent of operand bit width. The operations are performed digit by digit serially from left to right which allows gradual increase in the slice activities making it suitable for implementation on reconfigurable devices. Serial nature of the online algorithm and gradual increment/decrement of active slices minimize the interconnects and signal activities resulting in overall reduction of area and power consumption. We present online multipliers with; both inputs in serial, and one in serial and one in parallel. Pipelined and non-pipelined designs of the proposed multipliers have been synthesized with GSCL 45nm technology on Synopsys Design Compiler. Thorough comparative analysis has been performed using widely used performance metrics. The results show that the proposed online multipliers outperform the RL multipliers.

Related papers

R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference [77.47238561728459]
R-Sparse is a training-free activation sparsity approach capable of achieving high sparsity levels in advanced LLMs. Experiments on Llama-2/3 and Mistral models across ten diverse tasks demonstrate that R-Sparse achieves comparable performance at 50% model-level sparsity.
arXiv Detail & Related papers (2025-04-28T03:30:32Z)
DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products [63.66021758150632]
Linear Recurrent Neural Networks (linear RNNs) have emerged as competitive alternatives to Transformers for sequence modeling. Existing architectures face a fundamental trade-off between expressivity and efficiency, dictated by the structure of their state-transition matrices. We introduce DeltaProduct, which takes multiple ($n_h$) steps per token and achieves superior state-tracking and language modeling capabilities.
arXiv Detail & Related papers (2025-02-14T16:59:05Z)
Joint Transmit and Pinching Beamforming for Pinching Antenna Systems (PASS): Optimization-Based or Learning-Based? [89.05848771674773]
A novel antenna system ()-enabled downlink multi-user multiple-input single-output (MISO) framework is proposed. It consists of multiple waveguides, which equip numerous low-cost antennas, named (PAs) The positions of PAs can be reconfigured to both spanning large-scale path and space.
arXiv Detail & Related papers (2025-02-12T18:54:10Z)
Multi-qubit Lattice Surgery Scheduling [3.7126786554865774]
A quantum circuit can be transpiled into a sequence of solely non-Clifford multi-qubit gates. We show that the transpilation significantly reduces the circuit length on the set of circuits tested. The resulting circuit of multi-qubit gates has a further reduction in the expected circuit execution time compared to serial execution.
arXiv Detail & Related papers (2024-05-27T22:41:41Z)
Fast, Scalable, Warm-Start Semidefinite Programming with Spectral Bundling and Sketching [53.91395791840179]
We present Unified Spectral Bundling with Sketching (USBS), a provably correct, fast and scalable algorithm for solving massive SDPs. USBS provides a 500x speed-up over the state-of-the-art scalable SDP solver on an instance with over 2 billion decision variables.
arXiv Detail & Related papers (2023-12-19T02:27:22Z)
DSLOT-NN: Digit-Serial Left-to-Right Neural Network Accelerator [0.6435156676256051]
We propose a Digit-Serial Left-tO-righT arithmetic based processing technique called DSLOT-NN. The proposed work has the ability to assess and terminate the ineffective convolutions which results in massive power and energy savings.
arXiv Detail & Related papers (2023-09-12T07:36:23Z)
ADC/DAC-Free Analog Acceleration of Deep Neural Networks with Frequency Transformation [2.7488316163114823]
This paper proposes a novel approach to an energy-efficient acceleration of frequency-domain neural networks by utilizing analog-domain frequency-based tensor transformations. Our approach achieves more compact cells by eliminating the need for trainable parameters in the transformation matrix. On a 16$times$16 crossbars, for 8-bit input processing, the proposed approach achieves the energy efficiency of 1602 tera operations per second per Watt.
arXiv Detail & Related papers (2023-09-04T19:19:39Z)
ReLU and Addition-based Gated RNN [1.484528358552186]
We replace the multiplication and sigmoid function of the conventional recurrent gate with addition and ReLU activation. This mechanism is designed to maintain long-term memory for sequence processing but at a reduced computational cost.
arXiv Detail & Related papers (2023-08-10T15:18:16Z)
Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency. We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z)
Multiplier with Reduced Activities and Minimized Interconnect for Inner Product Arrays [0.8078491757252693]
We present a pipelined multiplier with reduced activities and minimized interconnect based on online digit-serial arithmetic. For $8, 16, 24$ and $32$ bit precision, the proposed low power pipelined design show upto $38%$ and $44%$ reduction in power and area respectively.
arXiv Detail & Related papers (2022-04-11T05:45:43Z)
Scaling the Convex Barrier with Sparse Dual Algorithms [141.4085318878354]
We present two novel dual algorithms for tight and efficient neural network bounding. Both methods recover the strengths of the new relaxation: tightness and a linear separation oracle. We can obtain better bounds than off-the-shelf solvers in only a fraction of their running time.
arXiv Detail & Related papers (2021-01-14T19:45:17Z)
WrapNet: Neural Net Inference with Ultra-Low-Resolution Arithmetic [57.07483440807549]
We propose a method that adapts neural networks to use low-resolution (8-bit) additions in the accumulators, achieving classification accuracy comparable to their 32-bit counterparts. We demonstrate the efficacy of our approach on both software and hardware platforms.
arXiv Detail & Related papers (2020-07-26T23:18:38Z)
Straggler-aware Distributed Learning: Communication Computation Latency Trade-off [56.08535873173518]
Straggling workers can be tolerated by assigning redundant computations and coding across data and computations. In most existing schemes, each non-straggling worker transmits one message per iteration to the parameter server (PS) after completing all its computations. Imposing such a limitation results in two main drawbacks; over-computation due to inaccurate prediction of the straggling behaviour, and under-utilization due to treating workers as straggler/non-straggler.
arXiv Detail & Related papers (2020-04-10T08:39:36Z)
Accelerating Feedforward Computation via Parallel Nonlinear Equation Solving [106.63673243937492]
Feedforward computation, such as evaluating a neural network or sampling from an autoregressive model, is ubiquitous in machine learning. We frame the task of feedforward computation as solving a system of nonlinear equations. We then propose to find the solution using a Jacobi or Gauss-Seidel fixed-point method, as well as hybrid methods of both. Our method is guaranteed to give exactly the same values as the original feedforward computation with a reduced (or equal) number of parallelizable iterations, and hence reduced time given sufficient parallel computing power.
arXiv Detail & Related papers (2020-02-10T10:11:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.