Related papers: Strassen Multisystolic Array Hardware Architectures

Strassen Multisystolic Array Hardware Architectures

URL: http://arxiv.org/abs/2502.10063v1
Date: Fri, 14 Feb 2025 10:40:32 GMT
Title: Strassen Multisystolic Array Hardware Architectures
Authors: Trevor E. Pogue, Nicola Nicolici,
Abstract summary: Strassen's matrix multiplication algorithm reduces the complexity of naive matrix multiplication.<n>General-purpose hardware is not suitable for achieving the algorithm's promised theoretical speedups.<n>We present and evaluate new systolic array architectures that efficiently translate the theoretical complexity reductions of Strassen's algorithm directly into hardware resource savings.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Strassen's matrix multiplication algorithm reduces the complexity of naive matrix multiplication, general-purpose hardware is not suitable for achieving the algorithm's promised theoretical speedups. This leaves the question of if it could be better exploited in custom hardware architectures designed specifically for executing the algorithm. However, there is limited prior work on this and it is not immediately clear how to derive such architectures or if they can ultimately lead to real improvements. We bridge this gap, presenting and evaluating new systolic array architectures that efficiently translate the theoretical complexity reductions of Strassen's algorithm directly into hardware resource savings. Furthermore, the architectures are multisystolic array designs that can multiply smaller matrices with higher utilization than single-systolic array designs. The proposed designs implemented on FPGA reduce DSP requirements by a factor of $1.14^r$ for $r$ implemented Strassen recursion levels, and otherwise require overall similar soft logic resources when instantiated to support matrix sizes down to 32x32 and 24x24 at 1-2 levels of Strassen recursion, respectively. We evaluate the proposed designs both in isolation and in an end-to-end machine learning accelerator compared to baseline designs and prior works, achieving state-of-the-art performance.

Related papers

Enhanced Computationally Efficient Long LoRA Inspired Perceiver Architectures for Auto-Regressive Language Modeling [2.9228447484533695]
The Transformer architecture has revolutionized the Natural Language Processing field and is the backbone of Large Language Models (LLMs)<n>One of the challenges in the Transformer architecture is the quadratic complexity of the attention mechanism that prohibits the efficient processing of long sequence lengths.<n>One of the important works in this respect is the Perceiver class of architectures that have demonstrated excellent performance while reducing the computation complexity.
arXiv Detail & Related papers (2024-12-08T23:41:38Z)
AsCAN: Asymmetric Convolution-Attention Networks for Efficient Recognition and Generation [48.82264764771652]
We introduce AsCAN -- a hybrid architecture, combining both convolutional and transformer blocks. AsCAN supports a variety of tasks: recognition, segmentation, class-conditional image generation. We then scale the same architecture to solve a large-scale text-to-image task and show state-of-the-art performance.
arXiv Detail & Related papers (2024-11-07T18:43:17Z)
All-to-all reconfigurability with sparse and higher-order Ising machines [0.0]
We introduce a multiplexed architecture that emulates all-to-all network functionality. We show that running the adaptive parallel tempering algorithm demonstrates competitive algorithmic and prefactor advantages. scaled magnetic versions of p-bit IMs could lead to orders of magnitude improvements over the state of the art for generic optimization.
arXiv Detail & Related papers (2023-11-21T20:27:02Z)
KyberMat: Efficient Accelerator for Matrix-Vector Polynomial Multiplication in CRYSTALS-Kyber Scheme via NTT and Polyphase Decomposition [20.592217626952507]
CRYSTAL-Kyber (Kyber) is one of the post-quantum cryptography (PQC) key-encapsulation mechanism (KEM) schemes selected during the standardization process. This paper addresses optimization for Kyber architecture with respect to latency and throughput constraints.
arXiv Detail & Related papers (2023-10-06T22:57:25Z)
INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient. We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture. We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z)
Batch-efficient EigenDecomposition for Small and Medium Matrices [65.67315418971688]
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications. We propose a QR-based ED method dedicated to the application scenarios of computer vision.
arXiv Detail & Related papers (2022-07-09T09:14:12Z)
A Structured Method for Compilation of QAOA Circuits in Quantum Computing [5.560410979877026]
The flexibility in reordering the two-qubit gates allows compiler optimizations to generate circuits with better depths, gate count, and fidelity. We propose a structured method that ensures linear depth for any compiled QAOA circuit on multi-dimensional quantum architectures. Overall, we can compile a circuit with up to 1024 qubits in 10 seconds with a 3.8X speedup in depth, 17% reduction in gate count, and 18X improvement for circuit ESP.
arXiv Detail & Related papers (2021-12-12T04:00:45Z)
Reconfigurable co-processor architecture with limited numerical precision to accelerate deep convolutional neural networks [0.38848561367220275]
Convolutional Neural Networks (CNNs) are widely used in deep learning applications, e.g. visual systems, robotics etc. Here, we present a model-independent reconfigurable co-processing architecture to accelerate CNNs. In contrast to existing solutions, we introduce limited precision 32 bit Q-format fixed point quantization for arithmetic representations and operations.
arXiv Detail & Related papers (2021-08-21T09:50:54Z)
iDARTS: Differentiable Architecture Search with Stochastic Implicit Gradients [75.41173109807735]
Differentiable ARchiTecture Search (DARTS) has recently become the mainstream of neural architecture search (NAS) We tackle the hypergradient computation in DARTS based on the implicit function theorem. We show that the architecture optimisation with the proposed method, named iDARTS, is expected to converge to a stationary point.
arXiv Detail & Related papers (2021-06-21T00:44:11Z)
Towards Accurate and Compact Architectures via Neural Architecture Transformer [95.4514639013144]
It is necessary to optimize the operations inside an architecture to improve the performance without introducing extra computational cost. We have proposed a Neural Architecture Transformer (NAT) method which casts the optimization problem into a Markov Decision Process (MDP) We propose a Neural Architecture Transformer++ (NAT++) method which further enlarges the set of candidate transitions to improve the performance of architecture optimization.
arXiv Detail & Related papers (2021-02-20T09:38:10Z)
PolyDL: Polyhedral Optimizations for Creation of High Performance DL primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives. We develop novel data reuse analysis algorithms using the polyhedral model. We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.