Related papers: The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding

The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding

URL: http://arxiv.org/abs/2110.10221v1
Date: Tue, 19 Oct 2021 19:39:04 GMT
Title: The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding
Authors: Pratik Fegade, Tianqi Chen, Phillip B. Gibbons, Todd C. Mowry
Abstract summary: CoRa is a tensor compiler that allows users to easily generate efficient code for ragged tensor operators. We evaluate CoRa on a variety of operators on ragged tensors as well as on an encoder layer of the transformer model.
Score: 14.635810503599759
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: There is often variation in the shape and size of input data used for deep learning. In many cases, such data can be represented using tensors with non-uniform shapes, or ragged tensors. Due to limited and non-portable support for efficient execution on ragged tensors, current deep learning frameworks generally use techniques such as padding and masking to make the data shapes uniform and then offload the computations to optimized kernels for dense tensor algebra. Such techniques can, however, lead to a lot of wasted computation and therefore, a loss in performance. This paper presents CoRa, a tensor compiler that allows users to easily generate efficient code for ragged tensor operators targeting a wide range of CPUs and GPUs. Evaluating CoRa on a variety of operators on ragged tensors as well as on an encoder layer of the transformer model, we find that CoRa (i)performs competitively with hand-optimized implementations of the operators and the transformer encoder and (ii) achieves, over PyTorch, a 1.6X geomean speedup for the encoder on an Nvidia GPU and a 1.86X geomean speedup for the multi-head attention module used in transformers on an ARM CPU.

Related papers

An Efficient Sparse Kernel Generator for O(3)-Equivariant Deep Networks [0.5737287537823071]
Rotation equivariant graph neural networks yield state-of-the-art performance on spatial deep learning tasks. Key to these models is the Clebsch-Gordon (CG) tensor product, a kernel that contracts two dense feature vectors with a highly structured sparse tensor to produce a dense output vector. We introduce a GPU sparse kernel generator for the CG tensor product that provides significant speedup over the best existing open and closed-source implementations.
arXiv Detail & Related papers (2025-01-23T08:20:47Z)
Tensor-GaLore: Memory-Efficient Training via Gradient Tensor Decomposition [93.98343072306619]
We present Navier-GaLore, a novel method for efficient training of neural networks with higher-order tensor weights. Across various PDE tasks, Navier-GaLore achieves substantial memory savings, reducing memory usage by up to 75%.
arXiv Detail & Related papers (2025-01-04T20:51:51Z)
FTuner: A Fast Dynamic Shape Tensors Program Auto-Tuner for Deep Learning Compilers [6.194917248699324]
This paper proposes a new technique for deep learning compilers called FTuner. Experiments show that the FTuner can achieve comparable operators and end-to-end performance to vendor libraries.
arXiv Detail & Related papers (2024-07-31T08:05:33Z)
Scalable CP Decomposition for Tensor Learning using GPU Tensor Cores [47.87810316745786]
We propose a compression-based tensor decomposition framework, namely the exascale-tensor, to support exascale tensor decomposition. Compared to the baselines, the exascale-tensor supports 8,000x larger tensors and a speedup up to 6.95x. We also apply our method to two real-world applications, including gene analysis and tensor layer neural networks.
arXiv Detail & Related papers (2023-11-22T21:04:59Z)
TensorKrowch: Smooth integration of tensor networks in machine learning [46.0920431279359]
We introduceKrowch, an open source Python library built on top of PyTorch. Krowch allows users to construct any tensor network, train it, and integrate it as a layer in more intricate deep learning models.
arXiv Detail & Related papers (2023-06-14T15:55:19Z)
Low-Rank Tensor Function Representation for Multi-Dimensional Data Recovery [52.21846313876592]
Low-rank tensor function representation (LRTFR) can continuously represent data beyond meshgrid with infinite resolution. We develop two fundamental concepts for tensor functions, i.e., the tensor function rank and low-rank tensor function factorization. Our method substantiates the superiority and versatility of our method as compared with state-of-the-art methods.
arXiv Detail & Related papers (2022-12-01T04:00:38Z)
Hidet: Task Mapping Programming Paradigm for Deep Learning Tensor Programs [11.338285393619042]
We propose to embed the scheduling process into tensor programs and use dedicated mappings, called task mappings, to define the computation assignment and ordering. With the proposed paradigm, we implement a deep learning compiler - Hidet.
arXiv Detail & Related papers (2022-10-18T05:32:13Z)
Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE) Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z)
Cherry-Picking Gradients: Learning Low-Rank Embeddings of Visual Data via Differentiable Cross-Approximation [53.95297550117153]
We propose an end-to-end trainable framework that processes large-scale visual data tensors by looking emphat a fraction of their entries only. The proposed approach is particularly useful for large-scale multidimensional grid data, and for tasks that require context over a large receptive field.
arXiv Detail & Related papers (2021-05-29T08:39:57Z)
VersaGNN: a Versatile accelerator for Graph neural networks [81.1667080640009]
We propose textitVersaGNN, an ultra-efficient, systolic-array-based versatile hardware accelerator. textitVersaGNN achieves on average 3712$times$ speedup with 1301.25$times$ energy reduction on CPU, and 35.4$times$ speedup with 17.66$times$ energy reduction on GPU.
arXiv Detail & Related papers (2021-05-04T04:10:48Z)
UNIT: Unifying Tensorized Instruction Compilation [11.193044425743981]
Hardware vendors offer tensorized instructions for mixed-precision operations, like Intel VNNI, Core, and ARM-DOT. The lack of compilation techniques for this makes it hard to utilize these instructions. We develop a compiler framework to unify the compilation for these instructions.
arXiv Detail & Related papers (2021-01-21T06:22:58Z)
Tensor Relational Algebra for Machine Learning System Design [7.764107702934616]
We present an alternative implementation abstraction called the relational tensor algebra (TRA) TRA is a set-based algebra based on the relational algebra. Our empirical study shows that the optimized TRA-based back-end can significantly outperform alternatives for running ML in distributed clusters.
arXiv Detail & Related papers (2020-09-01T15:51:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.