Related papers: FlexiSAGA: A Flexible Systolic Array GEMM Accelerator for Sparse and Dense Processing

FlexiSAGA: A Flexible Systolic Array GEMM Accelerator for Sparse and Dense Processing

URL: http://arxiv.org/abs/2506.01566v1
Date: Mon, 02 Jun 2025 11:45:37 GMT
Title: FlexiSAGA: A Flexible Systolic Array GEMM Accelerator for Sparse and Dense Processing
Authors: Mika Markus Müller, Konstantin Lübeck, Alexander Louis-Ferdinand Jung, Jannik Steinmetz, Oliver Bringmann,
Abstract summary: We present FlexiSAGA, an AI hardware accelerator for the sparse and dense processing of general matrix multiplications (GEMMs)<n>We propose a DNN pruning method specifically tailored towards the FlexiSAGA architecture, allowing for near-optimal processing of dense and sparse convolution and fully-connected operators.<n>Our results show a whole DNN sparse-over-dense inference speedup ranging from 1.41 up to 4.28, outperforming commercial and literature-reported accelerator platforms.
Score: 40.197673152937256
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Artificial Intelligence (AI) algorithms, such as Deep Neural Networks (DNNs), have become an important tool for a wide range of applications, from computer vision to natural language processing. However, the computational complexity of DNN inference poses a significant challenge, particularly for processing on resource-constrained edge devices. One promising approach to address this challenge is the exploitation of sparsity in DNN operator weights. In this work, we present FlexiSAGA, an architecturally configurable and dataflow-flexible AI hardware accelerator for the sparse and dense processing of general matrix multiplications (GEMMs). FlexiSAGA supports seven different sparse and dense dataflows, enabling efficient processing of resource intensive DNN operators. Additionally, we propose a DNN pruning method specifically tailored towards the FlexiSAGA architecture, allowing for near-optimal processing of dense and sparse convolution and fully-connected operators, facilitating a DNN/HW co-design flow. Our results show a whole DNN sparse-over-dense inference speedup ranging from 1.41 up to 4.28, outperforming commercial and literature-reported accelerator platforms.

Related papers

SpikeX: Exploring Accelerator Architecture and Network-Hardware Co-Optimization for Sparse Spiking Neural Networks [3.758294848902233]
We propose a novel systolic-array SNN accelerator architecture, called SpikeX, to take on the challenges and opportunities stemming from unstructured sparsity.<n>SpikeX reduces memory access and increases data sharing and hardware utilization targeting computations spanning both time and space.
arXiv Detail & Related papers (2025-05-18T08:07:44Z)
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
Spyx: A Library for Just-In-Time Compiled Optimization of Spiking Neural Networks [0.08965418284317034]
Spiking Neural Networks (SNNs) offer to enhance energy efficiency through a reduced and low-power hardware footprint. This paper introduces Spyx, a new and lightweight SNN simulation and optimization library designed in JAX.
arXiv Detail & Related papers (2024-02-29T09:46:44Z)
Accurate and Efficient Event-based Semantic Segmentation Using Adaptive Spiking Encoder-Decoder Network [20.05283214295881]
Spiking neural networks (SNNs) are emerging as promising solutions for processing dynamic, asynchronous signals from event-based sensors. We develop an efficient spiking encoder-decoder network (SpikingEDN) for large-scale event-based semantic segmentation tasks. We harness the adaptive threshold which improves network accuracy, sparsity and robustness in streaming inference.
arXiv Detail & Related papers (2023-04-24T07:12:50Z)
Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency. We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z)
Complexity-Driven CNN Compression for Resource-constrained Edge AI [1.6114012813668934]
We propose a novel and computationally efficient pruning pipeline by exploiting the inherent layer-level complexities of CNNs. We define three modes of pruning, namely parameter-aware (PA), FLOPs-aware (FA), and memory-aware (MA), to introduce versatile compression of CNNs.
arXiv Detail & Related papers (2022-08-26T16:01:23Z)
FPGA-optimized Hardware acceleration for Spiking Neural Networks [69.49429223251178]
This work presents the development of a hardware accelerator for an SNN, with off-line training, applied to an image recognition task. The design targets a Xilinx Artix-7 FPGA, using in total around the 40% of the available hardware resources. It reduces the classification time by three orders of magnitude, with a small 4.5% impact on the accuracy, if compared to its software, full precision counterpart.
arXiv Detail & Related papers (2022-01-18T13:59:22Z)
Quantized Neural Networks via {-1, +1} Encoding Decomposition and Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks. We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z)
Real-time Multi-Task Diffractive Deep Neural Networks via Hardware-Software Co-design [1.6066483376871004]
This work proposes a novel hardware-software co-design method that enables robust and noise-resilient Multi-task Learning in D$2$NNs. Our experimental results demonstrate significant improvements in versatility and hardware efficiency, and also demonstrate the robustness of proposed multi-task D$2$NN architecture.
arXiv Detail & Related papers (2020-12-16T12:29:54Z)
PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning [57.20262984116752]
We introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space. With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency.
arXiv Detail & Related papers (2020-01-01T04:52:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.