Related papers: RASA: Efficient Register-Aware Systolic Array Matrix Engine for CPU

RASA: Efficient Register-Aware Systolic Array Matrix Engine for CPU

URL: http://arxiv.org/abs/2110.01752v1
Date: Tue, 5 Oct 2021 00:01:31 GMT
Title: RASA: Efficient Register-Aware Systolic Array Matrix Engine for CPU
Authors: Geonhwa Jeong, Eric Qin, Ananda Samajdar, Christopher J. Hughes, Sreenivas Subramoney, Hyesoon Kim, Tushar Krishna
Abstract summary: We propose RASA, Register-Aware Systolic Array. We develop techniques to divide an execution stage into several sub-stages and overlap instructions to hide overheads and run them concurrently. RASA-based designs improve performance significantly with negligible area and power overhead.
Score: 6.436294460697506
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: As AI-based applications become pervasive, CPU vendors are starting to incorporate matrix engines within the datapath to boost efficiency. Systolic arrays have been the premier architectural choice as matrix engines in offload accelerators. However, we demonstrate that incorporating them inside CPUs can introduce under-utilization and stalls due to limited register storage to amortize the fill and drain times of the array. To address this, we propose RASA, Register-Aware Systolic Array. We develop techniques to divide an execution stage into several sub-stages and overlap instructions to hide overheads and run them concurrently. RASA-based designs improve performance significantly with negligible area and power overhead.

Related papers

Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training. This paper first attributes the inefficiency of Transformers to the attention sink phenomenon resulting from the high variance of softmax operation. Experiments demonstrate that SWAT achieves SOTA performance compared with state-of-the-art linear recurrent architectures on eight benchmarks.
arXiv Detail & Related papers (2025-02-26T05:31:44Z)
SparAMX: Accelerating Compressed LLMs Token Generation on AMX-powered CPUs [5.760049762453579]
Accelerating large language models with CPUs enables broader AI access at a lower cost and power consumption. We provide a set of open-source customized sparse kernels that can speed up any PyTorch model. We demonstrate for the first time the use of unstructured sparsity in the attention achieving a $1.14 times$ speedup over the current systems.
arXiv Detail & Related papers (2025-02-18T02:26:34Z)
COMPASS: A Compiler Framework for Resource-Constrained Crossbar-Array Based In-Memory Deep Learning Accelerators [6.172271429579593]
We propose a compiler framework for resource-constrained crossbar-based processing-in-memory (PIM) deep neural network (DNN) accelerators. We propose an algorithm to determine the optimal partitioning that divides the layers so that each partition can be accelerated on chip.
arXiv Detail & Related papers (2025-01-12T11:31:25Z)
Search for Efficient Large Language Models [52.98684997131108]
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research. Weight pruning, quantization, and distillation have been embraced to compress LLMs, targeting memory reduction and inference acceleration. Most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.
arXiv Detail & Related papers (2024-09-25T21:32:12Z)
TrIM: Triangular Input Movement Systolic Array for Convolutional Neural Networks -- Part II: Architecture and Hardware Implementation [0.0]
TrIM is an innovative dataflow based on a triangular movement of inputs. TrIM can reduce the number of memory accesses by one order of magnitude when compared to state-of-the-art systolic arrays. architecture achieves a peak throughput of 453.6 Giga Operations per Second.
arXiv Detail & Related papers (2024-08-05T10:18:00Z)
Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels. We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z)
PopSparse: Accelerated block sparse matrix multiplication on IPU [0.5661403709207713]
We introduce PopSparse, a library that enables fast sparse operations on Graphcore IPUs. We target two different types of sparsity: static, where the sparsity pattern is fixed at compile-time; and dynamic, where it can change each time the model is run. Results indicate that the PopSparse implementations are faster than dense matrix multiplications on IPU at a range of sparsity levels.
arXiv Detail & Related papers (2023-03-29T20:00:19Z)
Performance Embeddings: A Similarity-based Approach to Automatic Performance Optimization [71.69092462147292]
Performance embeddings enable knowledge transfer of performance tuning between applications. We demonstrate this transfer tuning approach on case studies in deep neural networks, dense and sparse linear algebra compositions, and numerical weather prediction stencils.
arXiv Detail & Related papers (2023-03-14T15:51:35Z)
VEGETA: Vertically-Integrated Extensions for Sparse/Dense GEMM Tile Acceleration on CPUs [7.807134159136234]
This work presents VEGETA, a set of ISA and microarchitecture extensions over dense matrix engines to support flexible structured sparsity for CPUs. A VEGETA engine provides 1.09x, 2.20x, 3.74x, and 3.28x speed-ups when running 4:4 (dense), 2:4, 1:4, and unstructured sparse layers.
arXiv Detail & Related papers (2023-02-17T04:35:58Z)
FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems [62.20308752994373]
We propose a new smart network interface card (NIC) for distributed AI training systems using field-programmable gate arrays (FPGAs) Our proposed FPGA-based AI smart NIC enhances overall training performance by 1.6x at 6 nodes, with an estimated 2.5x performance improvement at 32 nodes, compared to the baseline system using conventional NICs.
arXiv Detail & Related papers (2022-04-22T21:57:00Z)
A Deep Learning Inference Scheme Based on Pipelined Matrix Multiplication Acceleration Design and Non-uniform Quantization [9.454905560571085]
We introduce a low-power Multi-layer Perceptron (MLP) accelerator based on a pipelined matrix multiplication scheme and a nonuniform quantization methodology. Results show that our method can achieve better performance with fewer power consumption.
arXiv Detail & Related papers (2021-10-10T17:31:27Z)
Direct Spatial Implementation of Sparse Matrix Multipliers for Reservoir Computing [0.0]
Reservoir computing systems rely on the recurrent multiplication of a very large, sparse, fixed matrix. We argue that direct implementation of these fixed matrices minimizes the work performed in the computation. We present the structure of our bit-serial matrix multiplier, and evaluate using canonical signed digit representation to further reduce logic utilization.
arXiv Detail & Related papers (2021-01-21T23:16:22Z)
PolyDL: Polyhedral Optimizations for Creation of High Performance DL primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives. We develop novel data reuse analysis algorithms using the polyhedral model. We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z)
On Coresets for Support Vector Machines [61.928187390362176]
A coreset is a small, representative subset of the original data points. We show that our algorithm can be used to extend the applicability of any off-the-shelf SVM solver to streaming, distributed, and dynamic data settings.
arXiv Detail & Related papers (2020-02-15T23:25:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.