A matrix math facility for Power ISA(TM) processors
- URL: http://arxiv.org/abs/2104.03142v1
- Date: Wed, 7 Apr 2021 14:17:32 GMT
- Title: A matrix math facility for Power ISA(TM) processors
- Authors: Jos\'e E. Moreira, Kit Barton, Steven Battle, Peter Bergner, Ramon
Bertran, Puneeth Bhat, Pedro Caldeira, David Edelsohn, Gordon Fossum, Brad
Frey, Nemanja Ivanovic, Chip Kerchner, Vincent Lim, Shakti Kapoor, Tulio
Machado Filho, Silvia Melitta Mueller, Brett Olsson, Satish Sadasivam,
Baptiste Saleil, Bill Schmidt, Rajalakshmi Srinivasaraghavan, Shricharan
Srivatsan, Brian Thompto, Andreas Wagner, Nelson Wu
- Abstract summary: A new family of matrix math instructions, collectively known as the Matrix-Multiply Assist facility, has been introduced in Power ISA(TM) Version 3.1.
These instructions have led to a power- and area-efficient implementation of a high throughput math engine in the future POWER10 processor.
Performance per core is 4 times better, at constant frequency, than the previous generation POWER9 processor.
- Score: 0.16910097443356495
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Power ISA(TM) Version 3.1 has introduced a new family of matrix math
instructions, collectively known as the Matrix-Multiply Assist (MMA) facility.
The instructions in this facility implement numerical linear algebra operations
on small matrices and are meant to accelerate computation-intensive kernels,
such as matrix multiplication, convolution and discrete Fourier transform.
These instructions have led to a power- and area-efficient implementation of a
high throughput math engine in the future POWER10 processor. Performance per
core is 4 times better, at constant frequency, than the previous generation
POWER9 processor. We also advocate the use of compiler built-ins as the
preferred way of leveraging these instructions, which we illustrate through
case studies covering matrix multiplication and convolution.
Related papers
- Compute Better Spent: Replacing Dense Layers with Structured Matrices [77.61728033234233]
We identify more efficient alternatives to dense matrices, as exemplified by the success of convolutional networks in the image domain.
We show that different structures often require drastically different initialization scales and learning rates, which are crucial to performance.
We propose a novel matrix family containing Monarch matrices, the Block-Train, which we show performs better than dense for the same compute on multiple tasks.
arXiv Detail & Related papers (2024-06-10T13:25:43Z) - CoLA: Exploiting Compositional Structure for Automatic and Efficient
Numerical Linear Algebra [62.37017125812101]
We propose a simple but general framework for large-scale linear algebra problems in machine learning, named CoLA.
By combining a linear operator abstraction with compositional dispatch rules, CoLA automatically constructs memory and runtime efficient numerical algorithms.
We showcase its efficacy across a broad range of applications, including partial differential equations, Gaussian processes, equivariant model construction, and unsupervised learning.
arXiv Detail & Related papers (2023-09-06T14:59:38Z) - AMULET: Adaptive Matrix-Multiplication-Like Tasks [6.094431019524036]
We extend an open-source compiler to recognize and optimize matrix multiplication-like tasks.
Our framework, called Amulet, uses both database-style and compiler optimization techniques.
Amulet typically performs within 15% of hand-tuned matrix multiplication libraries, while handling a much broader class of computations.
arXiv Detail & Related papers (2023-05-12T17:04:24Z) - Batch-efficient EigenDecomposition for Small and Medium Matrices [65.67315418971688]
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications.
We propose a QR-based ED method dedicated to the application scenarios of computer vision.
arXiv Detail & Related papers (2022-07-09T09:14:12Z) - Large Scale Distributed Linear Algebra With Tensor Processing Units [0.0]
We have curated Google Processing Units (TPUs), application-specific chips developed for machine learning, into large-scale dense linear algebra supercomputers.
The matrix-multiply units (MXU)s dominate the runtime, yielding impressive scaling, performance, and raw size.
arXiv Detail & Related papers (2021-12-16T16:55:22Z) - A Deep Learning Inference Scheme Based on Pipelined Matrix
Multiplication Acceleration Design and Non-uniform Quantization [9.454905560571085]
We introduce a low-power Multi-layer Perceptron (MLP) accelerator based on a pipelined matrix multiplication scheme and a nonuniform quantization methodology.
Results show that our method can achieve better performance with fewer power consumption.
arXiv Detail & Related papers (2021-10-10T17:31:27Z) - Robust 1-bit Compressive Sensing with Partial Gaussian Circulant
Matrices and Generative Priors [54.936314353063494]
We provide recovery guarantees for a correlation-based optimization algorithm for robust 1-bit compressive sensing.
We make use of a practical iterative algorithm, and perform numerical experiments on image datasets to corroborate our results.
arXiv Detail & Related papers (2021-08-08T05:28:06Z) - Multiplying Matrices Without Multiplying [0.0]
Multiplying matrices is among the most fundamental and compute-intensive operations in machine learning.
We introduce a learning-based algorithm for this task that greatly outperforms existing methods.
arXiv Detail & Related papers (2021-06-21T05:08:54Z) - Non-PSD Matrix Sketching with Applications to Regression and
Optimization [56.730993511802865]
We present dimensionality reduction methods for non-PSD and square-roots" matrices.
We show how these techniques can be used for multiple downstream tasks.
arXiv Detail & Related papers (2021-06-16T04:07:48Z) - Direct Spatial Implementation of Sparse Matrix Multipliers for Reservoir
Computing [0.0]
Reservoir computing systems rely on the recurrent multiplication of a very large, sparse, fixed matrix.
We argue that direct implementation of these fixed matrices minimizes the work performed in the computation.
We present the structure of our bit-serial matrix multiplier, and evaluate using canonical signed digit representation to further reduce logic utilization.
arXiv Detail & Related papers (2021-01-21T23:16:22Z) - What if Neural Networks had SVDs? [66.91160214071088]
Various Neural Networks employ time-consuming matrix operations like matrix inversion.
We present an algorithm that is fast enough to speed up several matrix operations.
arXiv Detail & Related papers (2020-09-29T12:58:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.