DYAD: A Descriptive Yet Abjuring Density efficient approximation to
linear neural network layers
- URL: http://arxiv.org/abs/2312.06881v1
- Date: Mon, 11 Dec 2023 23:04:48 GMT
- Title: DYAD: A Descriptive Yet Abjuring Density efficient approximation to
linear neural network layers
- Authors: Sarin Chandy, Varun Gangal, Yi Yang, Gabriel Maggiotti
- Abstract summary: We devise, implement and performance-asses DYAD, a layer which can serve as a faster and more memory-efficient approximate replacement for linear layers.
DYAD is based on a bespoke near-sparse matrix structure which approximates the dense "weight" matrix W that matrix-multiplies the input in the typical realization of such a layer, a.k.a DENSE.
- Score: 19.949611634077634
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We devise, implement and performance-asses DYAD, a layer which can serve as a
faster and more memory-efficient approximate replacement for linear layers,
(nn.Linear() in Pytorch). These layers appear in common subcomponents, such as
in the ff module of Transformers. DYAD is based on a bespoke near-sparse matrix
structure which approximates the dense "weight" matrix W that matrix-multiplies
the input in the typical realization of such a layer, a.k.a DENSE. Our
alternative near-sparse matrix structure is decomposable to a sum of 2 matrices
permutable to a block-sparse counterpart. These can be represented as 3D
tensors, which in unison allow a faster execution of matrix multiplication with
the mini-batched input matrix X compared to DENSE (O(rows(W ) x cols(W )) -->
O( rows(W ) x cols(W ) # of blocks )). As the crux of our experiments, we
pretrain both DYAD and DENSE variants of 2 sizes of the OPT arch and 1 size of
the Pythia arch, including at different token scales of the babyLM benchmark.
We find DYAD to be competitive (>= 90%) of DENSE performance on zero-shot (e.g.
BLIMP), few-shot (OPENLM) and finetuning (GLUE) benchmarks, while being >=7-15%
faster to train on-GPU even at 125m scale, besides surfacing larger speedups at
increasing scale and model width.
Related papers
- Two Sparse Matrices are Better than One: Sparsifying Neural Networks with Double Sparse Factorization [0.0]
We present Double Sparse Factorization (DSF), where we factorize each weight matrix into two sparse matrices.
Our method achieves state-of-the-art results, enabling unprecedented sparsification of neural networks.
arXiv Detail & Related papers (2024-09-27T15:48:39Z) - Compute Better Spent: Replacing Dense Layers with Structured Matrices [77.61728033234233]
We identify more efficient alternatives to dense matrices, as exemplified by the success of convolutional networks in the image domain.
We show that different structures often require drastically different initialization scales and learning rates, which are crucial to performance.
We propose a novel matrix family containing Monarch matrices, the Block-Train, which we show performs better than dense for the same compute on multiple tasks.
arXiv Detail & Related papers (2024-06-10T13:25:43Z) - Multi-level projection with exponential parallel speedup; Application to sparse auto-encoders neural networks [2.264332709661011]
We show that the time complexity for the $ell_1,infty$ norm is only $mathcalObig(n m big)$ for a matrix in $mathbbRntimes m$.
Experiments show that our projection is $2$ times faster than the actual fastest Euclidean algorithms.
arXiv Detail & Related papers (2024-05-03T13:21:49Z) - PopSparse: Accelerated block sparse matrix multiplication on IPU [0.5661403709207713]
We introduce PopSparse, a library that enables fast sparse operations on Graphcore IPUs.
We target two different types of sparsity: static, where the sparsity pattern is fixed at compile-time; and dynamic, where it can change each time the model is run.
Results indicate that the PopSparse implementations are faster than dense matrix multiplications on IPU at a range of sparsity levels.
arXiv Detail & Related papers (2023-03-29T20:00:19Z) - Batch-efficient EigenDecomposition for Small and Medium Matrices [65.67315418971688]
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications.
We propose a QR-based ED method dedicated to the application scenarios of computer vision.
arXiv Detail & Related papers (2022-07-09T09:14:12Z) - Monarch: Expressive Structured Matrices for Efficient and Accurate
Training [64.6871423399431]
Large neural networks excel in many domains, but they are expensive to train and fine-tune.
A popular approach to reduce their compute or memory requirements is to replace dense weight matrices with structured ones.
We propose a class of matrices (Monarch) that is hardware-efficient.
arXiv Detail & Related papers (2022-04-01T17:37:29Z) - Fast Differentiable Matrix Square Root and Inverse Square Root [65.67315418971688]
We propose two more efficient variants to compute the differentiable matrix square root and the inverse square root.
For the forward propagation, one method is to use Matrix Taylor Polynomial (MTP), and the other method is to use Matrix Pad'e Approximants (MPA)
A series of numerical tests show that both methods yield considerable speed-up compared with the SVD or the NS iteration.
arXiv Detail & Related papers (2022-01-29T10:00:35Z) - Unfolding Projection-free SDP Relaxation of Binary Graph Classifier via
GDPA Linearization [59.87663954467815]
Algorithm unfolding creates an interpretable and parsimonious neural network architecture by implementing each iteration of a model-based algorithm as a neural layer.
In this paper, leveraging a recent linear algebraic theorem called Gershgorin disc perfect alignment (GDPA), we unroll a projection-free algorithm for semi-definite programming relaxation (SDR) of a binary graph.
Experimental results show that our unrolled network outperformed pure model-based graph classifiers, and achieved comparable performance to pure data-driven networks but using far fewer parameters.
arXiv Detail & Related papers (2021-09-10T07:01:15Z) - VersaGNN: a Versatile accelerator for Graph neural networks [81.1667080640009]
We propose textitVersaGNN, an ultra-efficient, systolic-array-based versatile hardware accelerator.
textitVersaGNN achieves on average 3712$times$ speedup with 1301.25$times$ energy reduction on CPU, and 35.4$times$ speedup with 17.66$times$ energy reduction on GPU.
arXiv Detail & Related papers (2021-05-04T04:10:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.