Related papers: An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks

An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks

URL: http://arxiv.org/abs/2411.06360v3
Date: Fri, 02 May 2025 16:58:10 GMT
Title: An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks
Authors: Mohsen Dehghankar, Mahdi Erfanian, Abolfazl Asudeh,
Abstract summary: We propose algorithms to improve the inference time and memory efficiency of Deep Neural Networks (DNNs)<n>We focus on matrix multiplication as the bottleneck operation of inference.<n>Our experiments show up to a 5.24x speedup in the inference time.
Score: 8.779871128906787
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite their tremendous success and versatility, Deep Neural Networks (DNNs) such as Large Language Models (LLMs) suffer from inference inefficiency and rely on advanced computational infrastructure. To address these challenges and make these models more accessible and cost-effective, in this paper, we propose algorithms to improve the inference time and memory efficiency of DNNs with binary and ternary weight matrices. Particularly focusing on matrix multiplication as the bottleneck operation of inference, we observe that, once trained, the weight matrices of a model no longer change. This allows us to preprocess these matrices and create indices that help reduce the storage requirements by a logarithmic factor while enabling our efficient inference algorithms. Specifically, for a $n\times n$ weight matrix, our efficient algorithm guarantees a time complexity of $O(\frac{n^2}{\log n})$, a logarithmic factor improvement over the standard vector-matrix multiplication. Besides theoretical analysis, we conduct extensive experiments to evaluate the practical efficiency of our algorithms. Our results confirm the superiority of our approach both with respect to time and memory, as we observed a reduction in the multiplication time up to 29x and memory usage up to 6x. When applied to LLMs, our experiments show up to a 5.24x speedup in the inference time.

Related papers

Improving Algorithmic Efficiency using Cryptography [11.496343300483904]
We show how to use cryptography to improve the time complexity of solving computational problems. We show that under standard cryptographic assumptions, we can design algorithms that are determinantally faster than existing ones.
arXiv Detail & Related papers (2025-02-18T17:08:59Z)
Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization [22.317176475276725]
We investigate two remarkable phenomena observed during the fine-tuning of Large Language Models (LLMs) Fine-tuning only the $mathbfW_q$ and $mathbfW_v$ matrix significantly improves performance over optimizing the $mathbfW_k$ matrix. We propose a new strategy that improves fine-tuning efficiency in terms of both storage and time.
arXiv Detail & Related papers (2024-10-03T06:37:37Z)
Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores [3.6385567224218556]
Large language models (LLMs) have been widely applied but face challenges in efficient inference. We introduce a novel bipolar-INT data format that facilitates parallel computing and supports symmetric quantization. We implement an arbitrary precision matrix multiplication scheme that decomposes and recovers at the bit level, enabling flexible precision.
arXiv Detail & Related papers (2024-09-26T14:17:58Z)
Compute Better Spent: Replacing Dense Layers with Structured Matrices [77.61728033234233]
We identify more efficient alternatives to dense matrices, as exemplified by the success of convolutional networks in the image domain. We show that different structures often require drastically different initialization scales and learning rates, which are crucial to performance. We propose a novel matrix family containing Monarch matrices, the Block-Train, which we show performs better than dense for the same compute on multiple tasks.
arXiv Detail & Related papers (2024-06-10T13:25:43Z)
An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks. The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions. We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z)
Fast Matrix Multiplication Without Tears: A Constraint Programming Approach [8.52818380743467]
It is known that the multiplication of an $N times M$ matrix with an $M times P$ matrix can be performed using fewer multiplications than what the naive $NMP approach suggests. This gives rise to the constraint satisfaction problem of fast matrix multiplication. We propose a simple yet novel Constraint Programming approach to find non-commutative algorithms for fast matrix multiplication.
arXiv Detail & Related papers (2023-06-01T19:15:24Z)
RSC: Accelerating Graph Neural Networks Training via Randomized Sparse Computations [56.59168541623729]
Training graph neural networks (GNNs) is time consuming because sparse graph-based operations are hard to be accelerated by hardware. We explore trading off the computational precision to reduce the time complexity via sampling-based approximation. We propose Randomized Sparse Computation, which for the first time demonstrate the potential of training GNNs with approximated operations.
arXiv Detail & Related papers (2022-10-19T17:25:33Z)
Batch-efficient EigenDecomposition for Small and Medium Matrices [65.67315418971688]
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications. We propose a QR-based ED method dedicated to the application scenarios of computer vision.
arXiv Detail & Related papers (2022-07-09T09:14:12Z)
A Structured Sparse Neural Network and Its Matrix Calculations Algorithm [0.0]
We introduce a nonsymmetric, tridiagonal matrix with offdiagonal sparse entries and offset sub and super-diagonals. For the cases where the matrix inverse does not exist, a least square type pseudoinverse is provided. Results show significant improvement in computational costs specially when the size of matrix increases.
arXiv Detail & Related papers (2022-07-02T19:38:48Z)
Fast matrix multiplication for binary and ternary CNNs on ARM CPU [0.9135092203041721]
We propose fast algorithms of ternary, ternary-binary, and binary matrix multiplication for mobile devices with ARM architecture. Our algorithms can be used to implement inference of convolutional and fully connected layers of TNNs, TBNs, and BNNs. We evaluate them experimentally on ARM Cortex-A73 CPU and compare their inference speed to efficient implementations of full-precision, 8-bit, and 4-bit quantized matrix multiplications.
arXiv Detail & Related papers (2022-05-18T14:52:34Z)
Monarch: Expressive Structured Matrices for Efficient and Accurate Training [64.6871423399431]
Large neural networks excel in many domains, but they are expensive to train and fine-tune. A popular approach to reduce their compute or memory requirements is to replace dense weight matrices with structured ones. We propose a class of matrices (Monarch) that is hardware-efficient.
arXiv Detail & Related papers (2022-04-01T17:37:29Z)
High-Dimensional Sparse Bayesian Learning without Covariance Matrices [66.60078365202867]
We introduce a new inference scheme that avoids explicit construction of the covariance matrix. Our approach couples a little-known diagonal estimation result from numerical linear algebra with the conjugate gradient algorithm. On several simulations, our method scales better than existing approaches in computation time and memory.
arXiv Detail & Related papers (2022-02-25T16:35:26Z)
A Deep Learning Inference Scheme Based on Pipelined Matrix Multiplication Acceleration Design and Non-uniform Quantization [9.454905560571085]
We introduce a low-power Multi-layer Perceptron (MLP) accelerator based on a pipelined matrix multiplication scheme and a nonuniform quantization methodology. Results show that our method can achieve better performance with fewer power consumption.
arXiv Detail & Related papers (2021-10-10T17:31:27Z)
Multiplying Matrices Without Multiplying [0.0]
Multiplying matrices is among the most fundamental and compute-intensive operations in machine learning. We introduce a learning-based algorithm for this task that greatly outperforms existing methods.
arXiv Detail & Related papers (2021-06-21T05:08:54Z)
Provably Faster Algorithms for Bilevel Optimization [54.83583213812667]
Bilevel optimization has been widely applied in many important machine learning applications. We propose two new algorithms for bilevel optimization. We show that both algorithms achieve the complexity of $mathcalO(epsilon-1.5)$, which outperforms all existing algorithms by the order of magnitude.
arXiv Detail & Related papers (2021-06-08T21:05:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.