RASA: Efficient Register-Aware Systolic Array Matrix Engine for CPU
        - URL: http://arxiv.org/abs/2110.01752v1
 - Date: Tue, 5 Oct 2021 00:01:31 GMT
 - Title: RASA: Efficient Register-Aware Systolic Array Matrix Engine for CPU
 - Authors: Geonhwa Jeong, Eric Qin, Ananda Samajdar, Christopher J. Hughes,
  Sreenivas Subramoney, Hyesoon Kim, Tushar Krishna
 - Abstract summary: We propose RASA, Register-Aware Systolic Array.
We develop techniques to divide an execution stage into several sub-stages and overlap instructions to hide overheads and run them concurrently.
 RASA-based designs improve performance significantly with negligible area and power overhead.
 - Score: 6.436294460697506
 - License: http://creativecommons.org/licenses/by-nc-nd/4.0/
 - Abstract:   As AI-based applications become pervasive, CPU vendors are starting to
incorporate matrix engines within the datapath to boost efficiency. Systolic
arrays have been the premier architectural choice as matrix engines in offload
accelerators. However, we demonstrate that incorporating them inside CPUs can
introduce under-utilization and stalls due to limited register storage to
amortize the fill and drain times of the array. To address this, we propose
RASA, Register-Aware Systolic Array. We develop techniques to divide an
execution stage into several sub-stages and overlap instructions to hide
overheads and run them concurrently. RASA-based designs improve performance
significantly with negligible area and power overhead.
 
       
      
        Related papers
        - SystolicAttention: Fusing FlashAttention within a Single Systolic Array [2.8650887057567864]
Transformer models rely heavily on scaled dot-product attention (SDPA)<n>Current systolic-array-based accelerators face significant challenges when executing FlashAttention.<n>We propose FSA, an enhanced systolic array architecture that enables the entire FlashAttention algorithm to run entirely within a single systolic array.
arXiv  Detail & Related papers  (2025-07-15T14:04:17Z) - Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training.
This paper first attributes the inefficiency of Transformers to the attention sink phenomenon resulting from the high variance of softmax operation.
 Experiments demonstrate that SWAT achieves SOTA performance compared with state-of-the-art linear recurrent architectures on eight benchmarks.
arXiv  Detail & Related papers  (2025-02-26T05:31:44Z) - SparAMX: Accelerating Compressed LLMs Token Generation on AMX-powered   CPUs [5.760049762453579]
Accelerating large language models with CPUs enables broader AI access at a lower cost and power consumption.
We provide a set of open-source customized sparse kernels that can speed up any PyTorch model.
We demonstrate for the first time the use of unstructured sparsity in the attention achieving a $1.14 times$ speedup over the current systems.
arXiv  Detail & Related papers  (2025-02-18T02:26:34Z) - COMPASS: A Compiler Framework for Resource-Constrained Crossbar-Array   Based In-Memory Deep Learning Accelerators [6.172271429579593]
We propose a compiler framework for resource-constrained crossbar-based processing-in-memory (PIM) deep neural network (DNN) accelerators.
We propose an algorithm to determine the optimal partitioning that divides the layers so that each partition can be accelerated on chip.
arXiv  Detail & Related papers  (2025-01-12T11:31:25Z) - Search for Efficient Large Language Models [52.98684997131108]
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research.
Weight pruning, quantization, and distillation have been embraced to compress LLMs, targeting memory reduction and inference acceleration.
Most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.
arXiv  Detail & Related papers  (2024-09-25T21:32:12Z) - TrIM: Triangular Input Movement Systolic Array for Convolutional Neural   Networks -- Part II: Architecture and Hardware Implementation [0.0]
TrIM is an innovative dataflow based on a triangular movement of inputs.
 TrIM can reduce the number of memory accesses by one order of magnitude when compared to state-of-the-art systolic arrays.
 architecture achieves a peak throughput of 453.6 Giga Operations per Second.
arXiv  Detail & Related papers  (2024-08-05T10:18:00Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor   Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv  Detail & Related papers  (2023-04-25T05:04:44Z) - PopSparse: Accelerated block sparse matrix multiplication on IPU [0.5661403709207713]
We introduce PopSparse, a library that enables fast sparse operations on Graphcore IPUs.
We target two different types of sparsity: static, where the sparsity pattern is fixed at compile-time; and dynamic, where it can change each time the model is run.
Results indicate that the PopSparse implementations are faster than dense matrix multiplications on IPU at a range of sparsity levels.
arXiv  Detail & Related papers  (2023-03-29T20:00:19Z) - Performance Embeddings: A Similarity-based Approach to Automatic
  Performance Optimization [71.69092462147292]
Performance embeddings enable knowledge transfer of performance tuning between applications.
We demonstrate this transfer tuning approach on case studies in deep neural networks, dense and sparse linear algebra compositions, and numerical weather prediction stencils.
arXiv  Detail & Related papers  (2023-03-14T15:51:35Z) - VEGETA: Vertically-Integrated Extensions for Sparse/Dense GEMM Tile
  Acceleration on CPUs [7.807134159136234]
This work presents VEGETA, a set of ISA and microarchitecture extensions over dense matrix engines to support flexible structured sparsity for CPUs.
A VEGETA engine provides 1.09x, 2.20x, 3.74x, and 3.28x speed-ups when running 4:4 (dense), 2:4, 1:4, and unstructured sparse layers.
arXiv  Detail & Related papers  (2023-02-17T04:35:58Z) - FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems [62.20308752994373]
We propose a new smart network interface card (NIC) for distributed AI training systems using field-programmable gate arrays (FPGAs)
Our proposed FPGA-based AI smart NIC enhances overall training performance by 1.6x at 6 nodes, with an estimated 2.5x performance improvement at 32 nodes, compared to the baseline system using conventional NICs.
arXiv  Detail & Related papers  (2022-04-22T21:57:00Z) - A Deep Learning Inference Scheme Based on Pipelined Matrix
  Multiplication Acceleration Design and Non-uniform Quantization [9.454905560571085]
We introduce a low-power Multi-layer Perceptron (MLP) accelerator based on a pipelined matrix multiplication scheme and a nonuniform quantization methodology.
Results show that our method can achieve better performance with fewer power consumption.
arXiv  Detail & Related papers  (2021-10-10T17:31:27Z) - Direct Spatial Implementation of Sparse Matrix Multipliers for Reservoir
  Computing [0.0]
Reservoir computing systems rely on the recurrent multiplication of a very large, sparse, fixed matrix.
We argue that direct implementation of these fixed matrices minimizes the work performed in the computation.
We present the structure of our bit-serial matrix multiplier, and evaluate using canonical signed digit representation to further reduce logic utilization.
arXiv  Detail & Related papers  (2021-01-21T23:16:22Z) - PolyDL: Polyhedral Optimizations for Creation of High Performance DL
  primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives.
We develop novel data reuse analysis algorithms using the polyhedral model.
We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv  Detail & Related papers  (2020-06-02T06:44:09Z) - On Coresets for Support Vector Machines [61.928187390362176]
A coreset is a small, representative subset of the original data points.
We show that our algorithm can be used to extend the applicability of any off-the-shelf SVM solver to streaming, distributed, and dynamic data settings.
arXiv  Detail & Related papers  (2020-02-15T23:25:12Z) 
        This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.