Related papers: FTuner: A Fast Dynamic Shape Tensors Program Auto-Tuner for Deep Learning Compilers

FTuner: A Fast Dynamic Shape Tensors Program Auto-Tuner for Deep Learning Compilers

URL: http://arxiv.org/abs/2407.21418v1
Date: Wed, 31 Jul 2024 08:05:33 GMT
Title: FTuner: A Fast Dynamic Shape Tensors Program Auto-Tuner for Deep Learning Compilers
Authors: Pengyu Mu, Linquan Wei, Yi Liu, Rui Wang,
Abstract summary: This paper proposes a new technique for deep learning compilers called FTuner. Experiments show that the FTuner can achieve comparable operators and end-to-end performance to vendor libraries.
Score: 6.194917248699324
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Many artificial intelligence models process input data of different lengths and resolutions, making the shape of the tensors dynamic. The performance of these models depends on the shape of the tensors, which makes it difficult to optimize the tensors before the model runs. There are two common solutions to this problem. The first is to add useless data to the input to match a pre-optimized tensor library. The second is to use small basic tensors to create a tensor that is closest in size to the input data and then tune it to minimize padding. However, this second solution can be time-consuming. This paper proposes a new technique for deep learning compilers called FTuner. Instead of using a large design space or training a cost model, we use an abstract computational unit called the uKernel to patch together small, various-sized tensors to match the shape of the input tensor. We determine the shape of the uKernel using an analytic hardware information model. Experiments show that the FTuner can achieve comparable operators and end-to-end performance to vendor libraries and achieves 3\% speedup on existing auto-tuner with the model-training compiler while reducing tuning time by two orders of magnitude.

Related papers

Truncated Consistency Models [57.50243901368328]
Training consistency models requires learning to map all intermediate points along PF ODE trajectories to their corresponding endpoints. We empirically find that this training paradigm limits the one-step generation performance of consistency models. We propose a new parameterization of the consistency function and a two-stage training procedure that prevents the truncated-time training from collapsing to a trivial solution.
arXiv Detail & Related papers (2024-10-18T22:38:08Z)
OmniBal: Towards Fast Instruct-tuning for Vision-Language Models via Omniverse Computation Balance [65.48009829137824]
Large-scale 3D parallel training on vision-language instruct-tuning models leads to an imbalanced computation load across different devices. We rebalanced the computational loads from data, model, and memory perspectives to address this issue. Our method's efficacy and generalizability were further demonstrated across various models and datasets.
arXiv Detail & Related papers (2024-07-30T12:02:58Z)
MatFormer: Nested Transformer for Elastic Inference [94.1789252941718]
MatFormer is a nested Transformer architecture designed to offer elasticity in a variety of deployment constraints. We show that a 2.6B decoder-only MatFormer language model (MatLM) allows us to extract smaller models spanning from 1.5B to 2.6B. We also observe that smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval.
arXiv Detail & Related papers (2023-10-11T17:57:14Z)
Hidet: Task Mapping Programming Paradigm for Deep Learning Tensor Programs [11.338285393619042]
We propose to embed the scheduling process into tensor programs and use dedicated mappings, called task mappings, to define the computation assignment and ordering. With the proposed paradigm, we implement a deep learning compiler - Hidet.
arXiv Detail & Related papers (2022-10-18T05:32:13Z)
Near-Linear Time and Fixed-Parameter Tractable Algorithms for Tensor Decompositions [51.19236668224547]
We study low rank approximation of tensors, focusing on the tensor train and Tucker decompositions. For tensor train decomposition, we give a bicriteria $(1 + eps)$-approximation algorithm with a small bicriteria rank and $O(q cdot nnz(A))$ running time. In addition, we extend our algorithm to tensor networks with arbitrary graphs.
arXiv Detail & Related papers (2022-07-15T11:55:09Z)
DELTA: Dynamically Optimizing GPU Memory beyond Tensor Recomputation [29.804356645683463]
We propose a novel scheduler named DELTA for tensor swapping and tensor recomputation. We show that DELTA not only saves 40%-70% of GPU memory, surpassing the state-of-the-art method to a great extent.
arXiv Detail & Related papers (2022-03-30T01:40:25Z)
The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding [14.635810503599759]
CoRa is a tensor compiler that allows users to easily generate efficient code for ragged tensor operators. We evaluate CoRa on a variety of operators on ragged tensors as well as on an encoder layer of the transformer model.
arXiv Detail & Related papers (2021-10-19T19:39:04Z)
Cherry-Picking Gradients: Learning Low-Rank Embeddings of Visual Data via Differentiable Cross-Approximation [53.95297550117153]
We propose an end-to-end trainable framework that processes large-scale visual data tensors by looking emphat a fraction of their entries only. The proposed approach is particularly useful for large-scale multidimensional grid data, and for tasks that require context over a large receptive field.
arXiv Detail & Related papers (2021-05-29T08:39:57Z)
Multi-version Tensor Completion for Time-delayed Spatio-temporal Data [50.762087239885936]
Real-world-temporal data is often incomplete or inaccurate due to various data loading delays. We propose a low-rank tensor model to predict the updates over time. We obtain up to 27.2% lower root mean-squared-error compared to the best baseline method.
arXiv Detail & Related papers (2021-05-11T19:55:56Z)
Towards Compact Neural Networks via End-to-End Training: A Bayesian Tensor Approach with Automatic Rank Determination [11.173092834726528]
It is desirable to directly train a compact neural network from scratch with low memory and low computational cost. Low-rank tensor decomposition is one of the most effective approaches to reduce the memory and computing requirements of large-size neural networks. This paper presents a novel end-to-end framework for low-rank tensorized training of neural networks.
arXiv Detail & Related papers (2020-10-17T01:23:26Z)
Streaming Coresets for Symmetric Tensor Factorization [9.181791777532608]
We show how to factorize tensors efficiently in the streaming setting. We introduce two novel algorithmic techniques: online filtering and kernelization. We show applications of our algorithms in learning single topic modeling.
arXiv Detail & Related papers (2020-06-01T19:55:34Z)
Convolutional Tensor-Train LSTM for Spatio-temporal Learning [116.24172387469994]
We propose a higher-order LSTM model that can efficiently learn long-term correlations in the video sequence. This is accomplished through a novel tensor train module that performs prediction by combining convolutional features across time. Our results achieve state-of-the-art performance-art in a wide range of applications and datasets.
arXiv Detail & Related papers (2020-02-21T05:00:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.