Related papers: Large Scale Distributed Linear Algebra With Tensor Processing Units

Large Scale Distributed Linear Algebra With Tensor Processing Units

URL: http://arxiv.org/abs/2112.09017v1
Date: Thu, 16 Dec 2021 16:55:22 GMT
Title: Large Scale Distributed Linear Algebra With Tensor Processing Units
Authors: Adam G.M. Lewis, Jackson Beall, Martin Ganahl, Markus Hauru, Shrestha Basu Mallick, and Guifre Vidal
Abstract summary: We have curated Google Processing Units (TPUs), application-specific chips developed for machine learning, into large-scale dense linear algebra supercomputers. The matrix-multiply units (MXU)s dominate the runtime, yielding impressive scaling, performance, and raw size.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We have repurposed Google Tensor Processing Units (TPUs), application-specific chips developed for machine learning, into large-scale dense linear algebra supercomputers. The TPUs' fast inter-core interconnects (ICI)s, physically two-dimensional network topology, and high-bandwidth memory (HBM) permit distributed matrix multiplication algorithms to rapidly become computationally bound. In this regime, the matrix-multiply units (MXU)s dominate the runtime, yielding impressive scaling, performance, and raw size: operating in float32 precision, a full 2048-core pod of third generation TPUs can multiply two matrices with linear size $N= 220= 1 048 576$ in about 2 minutes. Via curated algorithms emphasizing large, single-core matrix multiplications, other tasks in dense linear algebra can similarly scale. As examples, we present (i) QR decomposition; (ii) resolution of linear systems; and (iii) the computation of matrix functions by polynomial iteration, demonstrated by the matrix polar factorization.

Related papers

Fast Homomorphic Linear Algebra with BLAS [11.269481520748839]
Homomorphic encryption opens a wide range of applications in privacy-preserving data manipulation, notably in AI. Many of those applications require significant linear algebra computations (matrix x vector products, and matrix x matrix products). This central role of linear algebra computations goes far beyond homomorphic algebra and applies to most areas of scientific computing. We show that the efficiency loss between CKKS-based encrypted square matrix multiplication and double-precision floating-point square matrix multiplication is a mere 4-12 factor, depending on the precise situation.
arXiv Detail & Related papers (2025-03-20T12:19:47Z)
An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks [8.779871128906787]
Large Language Models (LLMs) suffer from inference inefficiency while relying on advanced computational infrastructure. We propose algorithms to improve the inference time and memory efficiency of 1.58-bit LLMs with ternary weight matrices. Our results confirm the superiority of the approach both with respect to time and memory, as we observed a reduction in inference time up to 29x and memory usage up to 6x.
arXiv Detail & Related papers (2024-11-10T04:56:14Z)
Fast, Scalable, Energy-Efficient Non-element-wise Matrix Multiplication on FPGA [10.630802853096462]
Modern Neural Network (NN) architectures heavily rely on vast numbers of multiply-accumulate arithmetic operations. This paper proposes a high- throughput, scalable and energy efficient non-element-wise matrix multiplication unit on FPGAs. Using our AMU achieves up to 9x higher throughput and 112x higher energy efficiency over the state-of-the-art solutions for the FPGA-based Quantised Neural Network (QNN) accelerators.
arXiv Detail & Related papers (2024-07-02T15:28:10Z)
Compute Better Spent: Replacing Dense Layers with Structured Matrices [77.61728033234233]
We identify more efficient alternatives to dense matrices, as exemplified by the success of convolutional networks in the image domain. We show that different structures often require drastically different initialization scales and learning rates, which are crucial to performance. We propose a novel matrix family containing Monarch matrices, the Block-Train, which we show performs better than dense for the same compute on multiple tasks.
arXiv Detail & Related papers (2024-06-10T13:25:43Z)
An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks. The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions. We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z)
CoLA: Exploiting Compositional Structure for Automatic and Efficient Numerical Linear Algebra [62.37017125812101]
We propose a simple but general framework for large-scale linear algebra problems in machine learning, named CoLA. By combining a linear operator abstraction with compositional dispatch rules, CoLA automatically constructs memory and runtime efficient numerical algorithms. We showcase its efficacy across a broad range of applications, including partial differential equations, Gaussian processes, equivariant model construction, and unsupervised learning.
arXiv Detail & Related papers (2023-09-06T14:59:38Z)
Batch-efficient EigenDecomposition for Small and Medium Matrices [65.67315418971688]
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications. We propose a QR-based ED method dedicated to the application scenarios of computer vision.
arXiv Detail & Related papers (2022-07-09T09:14:12Z)
High-Dimensional Sparse Bayesian Learning without Covariance Matrices [66.60078365202867]
We introduce a new inference scheme that avoids explicit construction of the covariance matrix. Our approach couples a little-known diagonal estimation result from numerical linear algebra with the conjugate gradient algorithm. On several simulations, our method scales better than existing approaches in computation time and memory.
arXiv Detail & Related papers (2022-02-25T16:35:26Z)
A Deep Learning Inference Scheme Based on Pipelined Matrix Multiplication Acceleration Design and Non-uniform Quantization [9.454905560571085]
We introduce a low-power Multi-layer Perceptron (MLP) accelerator based on a pipelined matrix multiplication scheme and a nonuniform quantization methodology. Results show that our method can achieve better performance with fewer power consumption.
arXiv Detail & Related papers (2021-10-10T17:31:27Z)
Efficient GPU implementation of randomized SVD and its applications [17.71779625877989]
Matrix decompositions are ubiquitous in machine learning, including applications in dimensionality data compression and deep learning algorithms. Typical solutions for matrix decompositions have complexity which significantly increases their computational cost and time. We leverage efficient processing operations that can be run in parallel on modern Graphical Processing Units (GPUs) to reduce the computational burden of computing matrix decompositions.
arXiv Detail & Related papers (2021-10-05T07:42:41Z)
Unfolding Projection-free SDP Relaxation of Binary Graph Classifier via GDPA Linearization [59.87663954467815]
Algorithm unfolding creates an interpretable and parsimonious neural network architecture by implementing each iteration of a model-based algorithm as a neural layer. In this paper, leveraging a recent linear algebraic theorem called Gershgorin disc perfect alignment (GDPA), we unroll a projection-free algorithm for semi-definite programming relaxation (SDR) of a binary graph. Experimental results show that our unrolled network outperformed pure model-based graph classifiers, and achieved comparable performance to pure data-driven networks but using far fewer parameters.
arXiv Detail & Related papers (2021-09-10T07:01:15Z)
A matrix math facility for Power ISA(TM) processors [0.16910097443356495]
A new family of matrix math instructions, collectively known as the Matrix-Multiply Assist facility, has been introduced in Power ISA(TM) Version 3.1. These instructions have led to a power- and area-efficient implementation of a high throughput math engine in the future POWER10 processor. Performance per core is 4 times better, at constant frequency, than the previous generation POWER9 processor.
arXiv Detail & Related papers (2021-04-07T14:17:32Z)
Direct Spatial Implementation of Sparse Matrix Multipliers for Reservoir Computing [0.0]
Reservoir computing systems rely on the recurrent multiplication of a very large, sparse, fixed matrix. We argue that direct implementation of these fixed matrices minimizes the work performed in the computation. We present the structure of our bit-serial matrix multiplier, and evaluate using canonical signed digit representation to further reduce logic utilization.
arXiv Detail & Related papers (2021-01-21T23:16:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.