FlexSA: Flexible Systolic Array Architecture for Efficient Pruned DNN
Model Training
- URL: http://arxiv.org/abs/2004.13027v1
- Date: Mon, 27 Apr 2020 15:51:20 GMT
- Title: FlexSA: Flexible Systolic Array Architecture for Efficient Pruned DNN
Model Training
- Authors: Sangkug Lym, Mattan Erez
- Abstract summary: We find that pruning a model using a common training accelerator with large systolic arrays is extremely performance-inefficient.
To make a systolic array efficient for pruning and training, we propose FlexSA, a flexible systolic array architecture.
We also present a compilation for tiling matrix-multiplication-and-accumulation operations in a training workload to best utilize the resources of FlexSA.
- Score: 1.718730454558804
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Modern deep learning models have high memory and computation cost. To make
them fast and memory-cost efficient, structured model pruning is commonly used.
We find that pruning a model using a common training accelerator with large
systolic arrays is extremely performance-inefficient. To make a systolic array
efficient for pruning and training, we propose FlexSA, a flexible systolic
array architecture. FlexSA dynamically reconfigures the systolic array
structure and offers multiple sub-systolic operating modes, which are designed
for energy- and memory bandwidth-efficient processing of tensors with different
sizes and shapes. We also present a compilation heuristic for tiling
matrix-multiplication-and-accumulation operations in a training workload to
best utilize the resources of FlexSA. Based on our evaluation, FlexSA with the
proposed compilation heuristic improves compute resource utilization of pruning
and training modern CNN models by 37% compared to a conventional training
accelerator with a large systolic array. FlexSA also improves on-chip data
reuse by 1.7X saving 28% energy compared to naive systolic array splitting.
Related papers
- Flextron: Many-in-One Flexible Large Language Model [85.93260172698398]
We introduce Flextron, a network architecture and post-training model optimization framework supporting flexible model deployment.
We present a sample-efficient training method and associated routing algorithms for transforming an existing trained LLM into a Flextron model.
We demonstrate superior performance over multiple end-to-end trained variants and other state-of-the-art elastic networks, all with a single pretraining run that consumes a mere 7.63% tokens compared to original pretraining.
arXiv Detail & Related papers (2024-06-11T01:16:10Z) - Compute Better Spent: Replacing Dense Layers with Structured Matrices [77.61728033234233]
We identify more efficient alternatives to dense matrices, as exemplified by the success of convolutional networks in the image domain.
We show that different structures often require drastically different initialization scales and learning rates, which are crucial to performance.
We propose a novel matrix family containing Monarch matrices, the Block-Train, which we show performs better than dense for the same compute on multiple tasks.
arXiv Detail & Related papers (2024-06-10T13:25:43Z) - Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO)
MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts.
Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z) - Slapo: A Schedule Language for Progressive Optimization of Large Deep
Learning Model Training [17.556432199389615]
Slapo is a schedule language that decouples the execution of a tensor-level operator from its arithmetic definition.
We show that Slapo can improve training throughput by up to 2.92x on a single machine with 8 NVIDIA V100 GPUs.
arXiv Detail & Related papers (2023-02-16T00:34:53Z) - ArrayFlex: A Systolic Array Architecture with Configurable Transparent
Pipelining [0.0]
Convolutional Neural Networks (CNNs) are the state-of-the-art solution for many deep learning applications.
In this work, we focus on the design of a systolic array with a pipeline.
We show that ArrayFlex reduces the latency of state-of-the-art CNNs by 11%, on average, as compared to a traditional fixed-pipeline systolic array.
arXiv Detail & Related papers (2022-11-22T21:56:38Z) - FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems [62.20308752994373]
We propose a new smart network interface card (NIC) for distributed AI training systems using field-programmable gate arrays (FPGAs)
Our proposed FPGA-based AI smart NIC enhances overall training performance by 1.6x at 6 nodes, with an estimated 2.5x performance improvement at 32 nodes, compared to the baseline system using conventional NICs.
arXiv Detail & Related papers (2022-04-22T21:57:00Z) - Tevatron: An Efficient and Flexible Toolkit for Dense Retrieval [60.457378374671656]
Tevatron is a dense retrieval toolkit optimized for efficiency, flexibility, and code simplicity.
We show how Tevatron's flexible design enables easy generalization across datasets, model architectures, and accelerator platforms.
arXiv Detail & Related papers (2022-03-11T05:47:45Z) - Memory-efficient array redistribution through portable collective
communication [0.4096453902709291]
We present a type-directed approach to synthesizing array redistributions as sequences of MPI-style collective operations.
We prove formally that our synthesized redistributions are memory-efficient and perform no excessive data transfers.
We evaluate our approach against the XLA implementation and find that our approach delivers a geometric mean speedup of $1.22times$, with maximum speedups as a high as $5.7times$.
arXiv Detail & Related papers (2021-12-02T09:32:07Z) - Dynamic Probabilistic Pruning: A general framework for
hardware-constrained pruning at different granularities [80.06422693778141]
We propose a flexible new pruning mechanism that facilitates pruning at different granularities (weights, kernels, filters/feature maps)
We refer to this algorithm as Dynamic Probabilistic Pruning (DPP)
We show that DPP achieves competitive compression rates and classification accuracy when pruning common deep learning models trained on different benchmark datasets for image classification.
arXiv Detail & Related papers (2021-05-26T17:01:52Z) - High-performance, Distributed Training of Large-scale Deep Learning
Recommendation Models [18.63017668881868]
Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook.
In this paper we discuss the SW/HW co-designed solution for high-performance distributed training of large-scale DLRMs.
We demonstrate the capability to train very large DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup in terms of time to solution over previous systems.
arXiv Detail & Related papers (2021-04-12T02:15:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.