Related papers: FlexSA: Flexible Systolic Array Architecture for Efficient Pruned DNN Model Training

FlexSA: Flexible Systolic Array Architecture for Efficient Pruned DNN Model Training

URL: http://arxiv.org/abs/2004.13027v1
Date: Mon, 27 Apr 2020 15:51:20 GMT
Title: FlexSA: Flexible Systolic Array Architecture for Efficient Pruned DNN Model Training
Authors: Sangkug Lym, Mattan Erez
Abstract summary: We find that pruning a model using a common training accelerator with large systolic arrays is extremely performance-inefficient. To make a systolic array efficient for pruning and training, we propose FlexSA, a flexible systolic array architecture. We also present a compilation for tiling matrix-multiplication-and-accumulation operations in a training workload to best utilize the resources of FlexSA.
Score: 1.718730454558804
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Modern deep learning models have high memory and computation cost. To make them fast and memory-cost efficient, structured model pruning is commonly used. We find that pruning a model using a common training accelerator with large systolic arrays is extremely performance-inefficient. To make a systolic array efficient for pruning and training, we propose FlexSA, a flexible systolic array architecture. FlexSA dynamically reconfigures the systolic array structure and offers multiple sub-systolic operating modes, which are designed for energy- and memory bandwidth-efficient processing of tensors with different sizes and shapes. We also present a compilation heuristic for tiling matrix-multiplication-and-accumulation operations in a training workload to best utilize the resources of FlexSA. Based on our evaluation, FlexSA with the proposed compilation heuristic improves compute resource utilization of pruning and training modern CNN models by 37% compared to a conventional training accelerator with a large systolic array. FlexSA also improves on-chip data reuse by 1.7X saving 28% energy compared to naive systolic array splitting.

Related papers

Flexiffusion: Training-Free Segment-Wise Neural Architecture Search for Efficient Diffusion Models [50.260693393896716]
Diffusion models (DMs) are powerful generative models capable of producing high-fidelity images but constrained by high computational costs.<n>We propose Flexiffusion, a training-free NAS framework that jointly optimize generation schedules and model architectures without modifying pre-trained parameters.<n>Our work pioneers a resource-efficient paradigm for searching high-speed DMs without sacrificing quality.
arXiv Detail & Related papers (2025-06-03T06:02:50Z)
AutoHete: An Automatic and Efficient Heterogeneous Training System for LLMs [68.99086112477565]
Transformer-based large language models (LLMs) have demonstrated exceptional capabilities in sequence modeling and text generation. Existing heterogeneous training methods significantly expand the scale of trainable models but introduce substantial communication overheads and CPU workloads. We propose AutoHete, an automatic and efficient heterogeneous training system compatible with both single- GPU and multi- GPU environments.
arXiv Detail & Related papers (2025-02-27T14:46:22Z)
Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training.<n>This paper first attributes the inefficiency of Transformers to the attention sink phenomenon.<n>We replace softmax with the sigmoid function and utilize a balanced ALiBi and Rotary Position Embedding for efficient information compression and retention.
arXiv Detail & Related papers (2025-02-26T05:31:44Z)
Flextron: Many-in-One Flexible Large Language Model [85.93260172698398]
We introduce Flextron, a network architecture and post-training model optimization framework supporting flexible model deployment. We present a sample-efficient training method and associated routing algorithms for transforming an existing trained LLM into a Flextron model. We demonstrate superior performance over multiple end-to-end trained variants and other state-of-the-art elastic networks, all with a single pretraining run that consumes a mere 7.63% tokens compared to original pretraining.
arXiv Detail & Related papers (2024-06-11T01:16:10Z)
Compute Better Spent: Replacing Dense Layers with Structured Matrices [77.61728033234233]
We identify more efficient alternatives to dense matrices, as exemplified by the success of convolutional networks in the image domain. We show that different structures often require drastically different initialization scales and learning rates, which are crucial to performance. We propose a novel matrix family containing Monarch matrices, the Block-Train, which we show performs better than dense for the same compute on multiple tasks.
arXiv Detail & Related papers (2024-06-10T13:25:43Z)
Scaling Pre-trained Language Models to Deeper via Parameter-efficient Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO) MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts. Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z)
Slapo: A Schedule Language for Progressive Optimization of Large Deep Learning Model Training [17.556432199389615]
Slapo is a schedule language that decouples the execution of a tensor-level operator from its arithmetic definition. We show that Slapo can improve training throughput by up to 2.92x on a single machine with 8 NVIDIA V100 GPUs.
arXiv Detail & Related papers (2023-02-16T00:34:53Z)
ArrayFlex: A Systolic Array Architecture with Configurable Transparent Pipelining [0.0]
Convolutional Neural Networks (CNNs) are the state-of-the-art solution for many deep learning applications. In this work, we focus on the design of a systolic array with a pipeline. We show that ArrayFlex reduces the latency of state-of-the-art CNNs by 11%, on average, as compared to a traditional fixed-pipeline systolic array.
arXiv Detail & Related papers (2022-11-22T21:56:38Z)
FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems [62.20308752994373]
We propose a new smart network interface card (NIC) for distributed AI training systems using field-programmable gate arrays (FPGAs) Our proposed FPGA-based AI smart NIC enhances overall training performance by 1.6x at 6 nodes, with an estimated 2.5x performance improvement at 32 nodes, compared to the baseline system using conventional NICs.
arXiv Detail & Related papers (2022-04-22T21:57:00Z)
Tevatron: An Efficient and Flexible Toolkit for Dense Retrieval [60.457378374671656]
Tevatron is a dense retrieval toolkit optimized for efficiency, flexibility, and code simplicity. We show how Tevatron's flexible design enables easy generalization across datasets, model architectures, and accelerator platforms.
arXiv Detail & Related papers (2022-03-11T05:47:45Z)
Memory-efficient array redistribution through portable collective communication [0.4096453902709291]
We present a type-directed approach to synthesizing array redistributions as sequences of MPI-style collective operations. We prove formally that our synthesized redistributions are memory-efficient and perform no excessive data transfers. We evaluate our approach against the XLA implementation and find that our approach delivers a geometric mean speedup of $1.22times$, with maximum speedups as a high as $5.7times$.
arXiv Detail & Related papers (2021-12-02T09:32:07Z)
Dynamic Probabilistic Pruning: A general framework for hardware-constrained pruning at different granularities [80.06422693778141]
We propose a flexible new pruning mechanism that facilitates pruning at different granularities (weights, kernels, filters/feature maps) We refer to this algorithm as Dynamic Probabilistic Pruning (DPP) We show that DPP achieves competitive compression rates and classification accuracy when pruning common deep learning models trained on different benchmark datasets for image classification.
arXiv Detail & Related papers (2021-05-26T17:01:52Z)
High-performance, Distributed Training of Large-scale Deep Learning Recommendation Models [18.63017668881868]
Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook. In this paper we discuss the SW/HW co-designed solution for high-performance distributed training of large-scale DLRMs. We demonstrate the capability to train very large DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup in terms of time to solution over previous systems.
arXiv Detail & Related papers (2021-04-12T02:15:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.