Related papers: Multiplier with Reduced Activities and Minimized Interconnect for Inner Product Arrays

Multiplier with Reduced Activities and Minimized Interconnect for Inner Product Arrays

URL: http://arxiv.org/abs/2204.09515v1
Date: Mon, 11 Apr 2022 05:45:43 GMT
Title: Multiplier with Reduced Activities and Minimized Interconnect for Inner Product Arrays
Authors: Muhammad Usman, Jeong-A Lee and Milos D. Ercegovac
Abstract summary: We present a pipelined multiplier with reduced activities and minimized interconnect based on online digit-serial arithmetic. For $8, 16, 24$ and $32$ bit precision, the proposed low power pipelined design show upto $38%$ and $44%$ reduction in power and area respectively.
Score: 0.8078491757252693
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: We present a pipelined multiplier with reduced activities and minimized interconnect based on online digit-serial arithmetic. The working precision has been truncated such that $p<n$ bits are used to compute $n$ bits product, resulting in significant savings in area and power. The digit slices follow variable precision according to input, increasing upto $p$ and then decreases according to the error profile. Pipelining has been done to achieve high throughput and low latency which is desirable for compute intensive inner products. Synthesis results of the proposed designs have been presented and compared with the non-pipelined online multiplier, pipelined online multiplier with full working precision and conventional serial-parallel and array multipliers. For $8, 16, 24$ and $32$ bit precision, the proposed low power pipelined design show upto $38\%$ and $44\%$ reduction in power and area respectively compared to the pipelined online multiplier without working precision truncation.

Related papers

Hamming Attention Distillation: Binarizing Keys and Queries for Efficient Long-Context Transformers [18.469378618426294]
We introduce Hamming Attention Distillation (HAD), a framework that binarizes keys and queries in the attention mechanism to achieve significant efficiency gains. We implement HAD in custom hardware simulations, demonstrating superior performance characteristics compared to a custom hardware implementation of standard attention.
arXiv Detail & Related papers (2025-02-03T19:24:01Z)
Chain of Thought Empowers Transformers to Solve Inherently Serial Problems [57.58801785642868]
Chain of thought (CoT) is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness.
arXiv Detail & Related papers (2024-02-20T10:11:03Z)
TCNCA: Temporal Convolution Network with Chunked Attention for Scalable Sequence Processing [52.64837396100988]
MEGA is a recent transformer-based architecture, which utilizes a linear recurrent operator whose parallel computation, based on the FFT, scales as $O(LlogL)$, with $L$ being the sequence length. We build upon their approach by replacing the linear recurrence with a special temporal convolutional network which permits larger receptive field size with shallower networks, and reduces the computational complexity to $O(L)$. We evaluate TCNCA on EnWik8 language modeling, long-range-arena (LRA) sequence classification, as well as a synthetic reasoning benchmark associative recall.
arXiv Detail & Related papers (2023-12-09T16:12:25Z)
Low-Overhead Parallelisation of LCU via Commuting Operators [0.0]
Linear Combination of Unitaries (LCU) is a powerful scheme for the block encoding of operators but suffers from high overheads. We discuss the parallelisation of LCU and in particular the SELECT subroutine of LCU based on partitioning of observables into groups of commuting operators. We additionally discuss the parallelisation of QROM circuits which are a special case of our main results.
arXiv Detail & Related papers (2023-12-01T16:29:02Z)
Instant Complexity Reduction in CNNs using Locality-Sensitive Hashing [50.79602839359522]
We propose HASTE (Hashing for Tractable Efficiency), a parameter-free and data-free module that acts as a plug-and-play replacement for any regular convolution module. We are able to drastically compress latent feature maps without sacrificing much accuracy by using locality-sensitive hashing (LSH) In particular, we are able to instantly drop 46.72% of FLOPs while only losing 1.25% accuracy by just swapping the convolution modules in a ResNet34 on CIFAR-10 for our HASTE module.
arXiv Detail & Related papers (2023-09-29T13:09:40Z)
DeepPCR: Parallelizing Sequential Operations in Neural Networks [4.241834259165193]
We introduce DeepPCR, a novel algorithm which parallelizes typically sequential operations in order to speed up inference and training of neural networks. DeepPCR is based on interpreting a sequence of $L$ steps as the solution of a specific system of equations, which we recover using the Parallel Cyclic Reduction algorithm. To verify the theoretical lower complexity of the algorithm, and to identify regimes for speedup, we test the effectiveness of DeepPCR in parallelizing the forward and backward pass in multi-layer perceptrons.
arXiv Detail & Related papers (2023-09-28T10:15:30Z)
Tangent Transformers for Composition, Privacy and Removal [58.280295030852194]
Tangent Attention Fine-Tuning (TAFT) is a method for fine-tuning linearized transformers. Tangent Attention Fine-Tuning (TAFT) is a method for fine-tuning linearized transformers.
arXiv Detail & Related papers (2023-07-16T18:31:25Z)
Low-Latency Online Multiplier with Reduced Activities and Minimized Interconnect for Inner Product Arrays [0.8078491757252693]
This paper proposes a low latency multiplier based on online or left-to-right arithmetic. Online arithmetic enables overlapping successive operations regardless of data dependency. Serial nature of the online algorithm and gradual increment/decrement of active slices minimize the interconnects and signal activities.
arXiv Detail & Related papers (2023-04-06T01:22:27Z)
Fast quantum circuit cutting with randomized measurements [0.0]
We propose a new method to extend the size of a quantum computation beyond the number of physical qubits available on a single device. This is accomplished by randomly inserting measure-and-prepare channels to express the output state of a large circuit as a separable state across distinct devices.
arXiv Detail & Related papers (2022-07-29T15:13:04Z)
Tabula: Efficiently Computing Nonlinear Activation Functions for Secure Neural Network Inference [18.363580113885174]
Multiparty approaches to secure neural network inference commonly rely on garbled circuits. We propose Tabula, an algorithm based on secure lookup tables. Compared to garbled circuits with 8-bit quantized inputs, Tabula with 8-bit activations uses between $280$-$560 times$ less communication.
arXiv Detail & Related papers (2022-03-05T23:26:06Z)
Unfolding Projection-free SDP Relaxation of Binary Graph Classifier via GDPA Linearization [59.87663954467815]
Algorithm unfolding creates an interpretable and parsimonious neural network architecture by implementing each iteration of a model-based algorithm as a neural layer. In this paper, leveraging a recent linear algebraic theorem called Gershgorin disc perfect alignment (GDPA), we unroll a projection-free algorithm for semi-definite programming relaxation (SDR) of a binary graph. Experimental results show that our unrolled network outperformed pure model-based graph classifiers, and achieved comparable performance to pure data-driven networks but using far fewer parameters.
arXiv Detail & Related papers (2021-09-10T07:01:15Z)
HAWQV3: Dyadic Neural Network Quantization [73.11579145354801]
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values. We present HAWQV3, a novel mixed-precision integer-only quantization framework.
arXiv Detail & Related papers (2020-11-20T23:51:43Z)
Bayesian Bits: Unifying Quantization and Pruning [73.27732135853243]
We introduce Bayesian Bits, a practical method for joint mixed precision quantization and pruning through gradient based optimization. We experimentally validate our proposed method on several benchmark datasets and show that we can learn pruned, mixed precision networks.
arXiv Detail & Related papers (2020-05-14T16:00:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.