SparseTIR: Composable Abstractions for Sparse Compilation in Deep
Learning
- URL: http://arxiv.org/abs/2207.04606v1
- Date: Mon, 11 Jul 2022 03:49:53 GMT
- Title: SparseTIR: Composable Abstractions for Sparse Compilation in Deep
Learning
- Authors: Zihao Ye, Ruihang Lai, Junru Shao, Tianqi Chen, Luis Ceze
- Abstract summary: Sparse tensor compilers simplify the development of operators, but efficient sparse compilation for deep learning remains challenging.
We show that the key to addressing both challenges is two forms of composability.
In this paper, we propose SparseTIR, a sparse tensor compilation abstraction that offers composable formats and composable transformations.
- Score: 11.251022748134215
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sparse tensors are rapidly becoming critical components of modern deep
learning workloads. However, developing high-performance sparse operators can
be difficult and tedious, and existing vendor libraries cannot satisfy the
escalating demands from new operators. Sparse tensor compilers simplify the
development of operators, but efficient sparse compilation for deep learning
remains challenging because a single sparse format cannot maximize hardware
efficiency, and single-shot compilers cannot keep up with latest hardware and
system advances. We show that the key to addressing both challenges is two
forms of composability. In this paper, we propose SparseTIR, a sparse tensor
compilation abstraction that offers composable formats and composable
transformations for deep learning workloads. SparseTIR constructs a search
space over these composable components for performance tuning. With these
improvements, SparseTIR obtains consistent performance speedups vs vendor
libraries on GPUs for single operators: 1.1-3.3x for GNN operators and 1.1-4.4x
for sparse transformer operators. SparseTIR also accelerates end-to-end GNNs by
1.1-2.2x for GraphSAGE training and 0.9-26x for RGCN inference.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - TorchSparse++: Efficient Training and Inference Framework for Sparse
Convolution on GPUs [20.4238781638402]
Sparse convolution plays a pivotal role in emerging workloads, including point cloud processing in AR/VR, autonomous driving, and graph understanding in recommendation systems.
Existing GPU libraries offer two dataflow types for sparse convolution.
We introduce TorchSparse++, a new GPU library that achieves the best of both worlds.
arXiv Detail & Related papers (2023-10-25T21:02:38Z) - Adapting Language Models to Compress Contexts [71.98287002918941]
Transformer-based language models (LMs) are powerful and widely-applicable tools, but their usefulness is constrained by a finite context window.
We propose to adapt pre-trained LMs into AutoCompressors, which are capable of compressing long contexts into compact summary vectors.
We fine-tune OPT and Llama-2 models on sequences of up to 30,720 tokens and show that AutoCompressors can utilize long contexts to improve perplexity.
arXiv Detail & Related papers (2023-05-24T06:42:44Z) - Powerful and Extensible WFST Framework for RNN-Transducer Losses [71.56212119508551]
This paper presents a framework based on Weighted Finite-State Transducers (WFST) to simplify the development of modifications for RNN-Transducer (RNN-T) loss.
Existing implementations of RNN-T use-related code, which is hard to extend and debug.
We introduce two WFST-powered RNN-T implementations: "Compose-Transducer" and "Grid-Transducer"
arXiv Detail & Related papers (2023-03-18T10:36:33Z) - Boosting Neural Networks to Decompile Optimized Binaries [13.255618541522436]
Decompilation aims to transform a low-level program language (LPL) into its functionally-equivalent high-level program language (HPL)
We propose a novel learning-based approach named NeurDP, that targets compiler-optimized binaries.
arXiv Detail & Related papers (2023-01-03T06:45:54Z) - Hidet: Task Mapping Programming Paradigm for Deep Learning Tensor
Programs [11.338285393619042]
We propose to embed the scheduling process into tensor programs and use dedicated mappings, called task mappings, to define the computation assignment and ordering.
With the proposed paradigm, we implement a deep learning compiler - Hidet.
arXiv Detail & Related papers (2022-10-18T05:32:13Z) - The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal
Padding [14.635810503599759]
CoRa is a tensor compiler that allows users to easily generate efficient code for ragged tensor operators.
We evaluate CoRa on a variety of operators on ragged tensors as well as on an encoder layer of the transformer model.
arXiv Detail & Related papers (2021-10-19T19:39:04Z) - Tensor Processing Primitives: A Programming Abstraction for Efficiency
and Portability in Deep Learning Workloads [86.62083829086393]
This work introduces the Processing Primitives (TPP), a programming abstraction striving for efficient, portable implementation of Deep Learning-workloads with high-productivity.
TPPs define a compact, yet versatile set of 2D-tensor operators (or a virtual ISA), which can be utilized as building-blocks to construct complex operators on high-dimensional tensors.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end DL-workloads expressed entirely via TPPs that outperform state-of-the-art implementations on multiple platforms.
arXiv Detail & Related papers (2021-04-12T18:35:49Z) - Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch [75.69506249886622]
Sparsity in Deep Neural Networks (DNNs) has been widely studied to compress and accelerate the models on resource-constrained environments.
In this paper, we are the first to study training from scratch an N:M fine-grained structured sparse network.
arXiv Detail & Related papers (2021-02-08T05:55:47Z) - UNIT: Unifying Tensorized Instruction Compilation [11.193044425743981]
Hardware vendors offer tensorized instructions for mixed-precision operations, like Intel VNNI, Core, and ARM-DOT.
The lack of compilation techniques for this makes it hard to utilize these instructions.
We develop a compiler framework to unify the compilation for these instructions.
arXiv Detail & Related papers (2021-01-21T06:22:58Z) - PolyDL: Polyhedral Optimizations for Creation of High Performance DL
primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives.
We develop novel data reuse analysis algorithms using the polyhedral model.
We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.