Related papers: A matrix math facility for Power ISA(TM) processors

A matrix math facility for Power ISA(TM) processors

URL: http://arxiv.org/abs/2104.03142v1
Date: Wed, 7 Apr 2021 14:17:32 GMT
Title: A matrix math facility for Power ISA(TM) processors
Authors: Jos\'e E. Moreira, Kit Barton, Steven Battle, Peter Bergner, Ramon Bertran, Puneeth Bhat, Pedro Caldeira, David Edelsohn, Gordon Fossum, Brad Frey, Nemanja Ivanovic, Chip Kerchner, Vincent Lim, Shakti Kapoor, Tulio Machado Filho, Silvia Melitta Mueller, Brett Olsson, Satish Sadasivam, Baptiste Saleil, Bill Schmidt, Rajalakshmi Srinivasaraghavan, Shricharan Srivatsan, Brian Thompto, Andreas Wagner, Nelson Wu
Abstract summary: A new family of matrix math instructions, collectively known as the Matrix-Multiply Assist facility, has been introduced in Power ISA(TM) Version 3.1. These instructions have led to a power- and area-efficient implementation of a high throughput math engine in the future POWER10 processor. Performance per core is 4 times better, at constant frequency, than the previous generation POWER9 processor.
Score: 0.16910097443356495
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Power ISA(TM) Version 3.1 has introduced a new family of matrix math instructions, collectively known as the Matrix-Multiply Assist (MMA) facility. The instructions in this facility implement numerical linear algebra operations on small matrices and are meant to accelerate computation-intensive kernels, such as matrix multiplication, convolution and discrete Fourier transform. These instructions have led to a power- and area-efficient implementation of a high throughput math engine in the future POWER10 processor. Performance per core is 4 times better, at constant frequency, than the previous generation POWER9 processor. We also advocate the use of compiler built-ins as the preferred way of leveraging these instructions, which we illustrate through case studies covering matrix multiplication and convolution.

Related papers

Exploring the Performance Improvement of Tensor Processing Engines through Transformation in the Bit-weight Dimension of MACs [8.17483100683993]
We introduce a novel hardware perspective on matrix multiplication, focusing on the bit-weight dimension of multiply-accumulators (MACs) We propose four optimization techniques that improve timing, area, and power consumption. Our techniques achieve area efficiency improvements of 1.27x, 1.28x, 1.56x, and 1.44x, and energy efficiency gains of 1.04x, 1.56x, 1.49x, and 1.20x, respectively.
arXiv Detail & Related papers (2025-03-08T21:21:23Z)
Compute Better Spent: Replacing Dense Layers with Structured Matrices [77.61728033234233]
We identify more efficient alternatives to dense matrices, as exemplified by the success of convolutional networks in the image domain. We show that different structures often require drastically different initialization scales and learning rates, which are crucial to performance. We propose a novel matrix family containing Monarch matrices, the Block-Train, which we show performs better than dense for the same compute on multiple tasks.
arXiv Detail & Related papers (2024-06-10T13:25:43Z)
CoLA: Exploiting Compositional Structure for Automatic and Efficient Numerical Linear Algebra [62.37017125812101]
We propose a simple but general framework for large-scale linear algebra problems in machine learning, named CoLA. By combining a linear operator abstraction with compositional dispatch rules, CoLA automatically constructs memory and runtime efficient numerical algorithms. We showcase its efficacy across a broad range of applications, including partial differential equations, Gaussian processes, equivariant model construction, and unsupervised learning.
arXiv Detail & Related papers (2023-09-06T14:59:38Z)
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct [130.37945867605302]
We present WizardMath, which enhances the mathematical CoT reasoning abilities of large language models (LLMs) without using external python tools. Remarkably, WizardMath-Mistral 7B surpasses top-tier open-source LLMs by a substantial margin with higher data efficiency. Our preliminary exploration highlights the pivotal role of instruction evolution and process supervision in achieving exceptional math performance.
arXiv Detail & Related papers (2023-08-18T14:23:21Z)
AMULET: Adaptive Matrix-Multiplication-Like Tasks [6.094431019524036]
We extend an open-source compiler to recognize and optimize matrix multiplication-like tasks. Our framework, called Amulet, uses both database-style and compiler optimization techniques. Amulet typically performs within 15% of hand-tuned matrix multiplication libraries, while handling a much broader class of computations.
arXiv Detail & Related papers (2023-05-12T17:04:24Z)
Batch-efficient EigenDecomposition for Small and Medium Matrices [65.67315418971688]
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications. We propose a QR-based ED method dedicated to the application scenarios of computer vision.
arXiv Detail & Related papers (2022-07-09T09:14:12Z)
Large Scale Distributed Linear Algebra With Tensor Processing Units [0.0]
We have curated Google Processing Units (TPUs), application-specific chips developed for machine learning, into large-scale dense linear algebra supercomputers. The matrix-multiply units (MXU)s dominate the runtime, yielding impressive scaling, performance, and raw size.
arXiv Detail & Related papers (2021-12-16T16:55:22Z)
A Deep Learning Inference Scheme Based on Pipelined Matrix Multiplication Acceleration Design and Non-uniform Quantization [9.454905560571085]
We introduce a low-power Multi-layer Perceptron (MLP) accelerator based on a pipelined matrix multiplication scheme and a nonuniform quantization methodology. Results show that our method can achieve better performance with fewer power consumption.
arXiv Detail & Related papers (2021-10-10T17:31:27Z)
Robust 1-bit Compressive Sensing with Partial Gaussian Circulant Matrices and Generative Priors [54.936314353063494]
We provide recovery guarantees for a correlation-based optimization algorithm for robust 1-bit compressive sensing. We make use of a practical iterative algorithm, and perform numerical experiments on image datasets to corroborate our results.
arXiv Detail & Related papers (2021-08-08T05:28:06Z)
Multiplying Matrices Without Multiplying [0.0]
Multiplying matrices is among the most fundamental and compute-intensive operations in machine learning. We introduce a learning-based algorithm for this task that greatly outperforms existing methods.
arXiv Detail & Related papers (2021-06-21T05:08:54Z)
Non-PSD Matrix Sketching with Applications to Regression and Optimization [56.730993511802865]
We present dimensionality reduction methods for non-PSD and square-roots" matrices. We show how these techniques can be used for multiple downstream tasks.
arXiv Detail & Related papers (2021-06-16T04:07:48Z)
Direct Spatial Implementation of Sparse Matrix Multipliers for Reservoir Computing [0.0]
Reservoir computing systems rely on the recurrent multiplication of a very large, sparse, fixed matrix. We argue that direct implementation of these fixed matrices minimizes the work performed in the computation. We present the structure of our bit-serial matrix multiplier, and evaluate using canonical signed digit representation to further reduce logic utilization.
arXiv Detail & Related papers (2021-01-21T23:16:22Z)
What if Neural Networks had SVDs? [66.91160214071088]
Various Neural Networks employ time-consuming matrix operations like matrix inversion. We present an algorithm that is fast enough to speed up several matrix operations.
arXiv Detail & Related papers (2020-09-29T12:58:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.