Changing Base Without Losing Pace: A GPU-Efficient Alternative to MatMul in DNNs
- URL: http://arxiv.org/abs/2503.12211v1
- Date: Sat, 15 Mar 2025 17:31:36 GMT
- Title: Changing Base Without Losing Pace: A GPU-Efficient Alternative to MatMul in DNNs
- Authors: Nir Ailon, Akhiad Bercovich, Omri Weinstein,
- Abstract summary: We propose a cheaper alternative bilinear operator to matrix-multiplication in deep neural networks (DNNs)<n>We show that replacing emphall linear layers with STL and training from scratch, results in factor x2.7 reduction in FLOPs with a 0.5 emphaccuracy improvement.<n>Finetuning TinyLlama citetinyllama24 with STL layers on the Slim Pajama dataset, achieves similar accuracy to 2:4, with x2.2 FLOP speedup compared to x1.7 of the latter.
- Score: 1.8911962184174564
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a cheaper alternative bilinear operator to matrix-multiplication in deep neural networks (DNNs). Unlike many stubborn attempts to accelerate MatMuls in DNN inference, this operator is supported by capabilities of existing GPU hardware, most notably NVIDIA TensorCores. To our knowledge, this is the first GPU-native acceleration technique which \emph{does not decrease} (in fact, increases) the number of trainable parameters of the network, mitigating the accuracy-loss of compression-based techniques. Hence, this operator is at the same time more expressive than MatMul, yet requires substantially \emph{fewer} FLOPs to evaluate. We term this new operator \emph{Strassen-Tile} (STL). The main idea behind STL$(X,W)$ is a \emph{local} change-of-basis (learnable encoder) on weights and activation \emph{tiles}, after which we perform batched \emph{elementwise} products between tiles, and a final decoding transformation (inspired by algebraic pipelines from fast matrix and polynomial multiplication). We compare STL against two benchmarks. The first one is SoTA T2T-ViT on Imagenet-1K. Here we show that replacing \emph{all} linear layers with STL and training from scratch, results in factor x2.7 reduction in FLOPs with a 0.5 \emph{accuracy improvement}. Our second speed-accuracy comparison benchmark for pretrained LLMs is the most practical GPU-acceleration technique, \twofour structured Sparsity. Finetuning TinyLlama \cite{tinyllama24} with STL layers on the Slim Pajama dataset, achieves similar accuracy to 2:4, with x2.2 FLOP speedup compared to x1.7 of the latter. Finally, we discuss a group-theoretic approach for discovering \emph{universal} encoders for STL, which could lead to fast \emph{black-box} acceleration via approximate matrix-multiplication (AMM).
Related papers
- Speedy MASt3R [68.47052557089631]
MASt3R redefines image matching as a 3D task by leveraging DUSt3R and introducing a fast reciprocal matching scheme.<n>Fast MASt3R achieves a 54% reduction in inference time (198 ms to 91 ms per image pair) without sacrificing accuracy.<n>This advancement enables real-time 3D understanding, benefiting applications like mixed reality navigation and large-scale 3D scene reconstruction.
arXiv Detail & Related papers (2025-03-13T03:56:22Z) - Opara: Exploiting Operator Parallelism for Expediting DNN Inference on GPUs [20.506357657234755]
emphOpara is a resource- and interference-aware scheduling framework to accelerate Deep Neural Network (DNN) inference on GPU.
We implement and open source a prototype of emphOpara based on PyTorch in a emphnon-intrusive manner.
Prototype experiments with representative DNN and Transformer-based models demonstrate that emphOpara outperforms the default sequential textttCUDA Graph in PyTorch.
arXiv Detail & Related papers (2023-12-16T06:48:11Z) - DYAD: A Descriptive Yet Abjuring Density efficient approximation to
linear neural network layers [19.949611634077634]
We devise, implement and performance-asses DYAD, a layer which can serve as a faster and more memory-efficient approximate replacement for linear layers.
DYAD is based on a bespoke near-sparse matrix structure which approximates the dense "weight" matrix W that matrix-multiplies the input in the typical realization of such a layer, a.k.a DENSE.
arXiv Detail & Related papers (2023-12-11T23:04:48Z) - ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient Vision Transformer [6.473688838974095]
We propose a new type of multiplication-reduced model, dubbed $textbfShiftAddViT$, to achieve end-to-end inference speedups on GPUs.
Experiments on various 2D/3D vision tasks consistently validate the effectiveness of our proposed ShiftAddViT.
arXiv Detail & Related papers (2023-06-10T13:53:41Z) - SKI to go Faster: Accelerating Toeplitz Neural Networks via Asymmetric
Kernels [69.47358238222586]
Toeplitz Neural Networks (TNNs) are a recent sequence model with impressive results.
We aim to reduce O(n) computational complexity and O(n) relative positional encoder (RPE) multi-layer perceptron (MLP) and decay bias calls.
For bidirectional models, this motivates a sparse plus low-rank Toeplitz matrix decomposition.
arXiv Detail & Related papers (2023-05-15T21:25:35Z) - Improved techniques for deterministic l2 robustness [63.34032156196848]
Training convolutional neural networks (CNNs) with a strict 1-Lipschitz constraint under the $l_2$ norm is useful for adversarial robustness, interpretable gradients and stable training.
We introduce a procedure to certify robustness of 1-Lipschitz CNNs by replacing the last linear layer with a 1-hidden layer.
We significantly advance the state-of-the-art for standard and provable robust accuracies on CIFAR-10 and CIFAR-100.
arXiv Detail & Related papers (2022-11-15T19:10:12Z) - Monarch: Expressive Structured Matrices for Efficient and Accurate
Training [64.6871423399431]
Large neural networks excel in many domains, but they are expensive to train and fine-tune.
A popular approach to reduce their compute or memory requirements is to replace dense weight matrices with structured ones.
We propose a class of matrices (Monarch) that is hardware-efficient.
arXiv Detail & Related papers (2022-04-01T17:37:29Z) - Asynchronous Stochastic Optimization Robust to Arbitrary Delays [54.61797739710608]
We consider optimization with delayed gradients where, at each time stept$, the algorithm makes an update using a stale computation - d_t$ for arbitrary delay $d_t gradient.
Our experiments demonstrate the efficacy and robustness of our algorithm in cases where the delay distribution is skewed or heavy-tailed.
arXiv Detail & Related papers (2021-06-22T15:50:45Z) - VersaGNN: a Versatile accelerator for Graph neural networks [81.1667080640009]
We propose textitVersaGNN, an ultra-efficient, systolic-array-based versatile hardware accelerator.
textitVersaGNN achieves on average 3712$times$ speedup with 1301.25$times$ energy reduction on CPU, and 35.4$times$ speedup with 17.66$times$ energy reduction on GPU.
arXiv Detail & Related papers (2021-05-04T04:10:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.