TLP: A Deep Learning-based Cost Model for Tensor Program Tuning
- URL: http://arxiv.org/abs/2211.03578v1
- Date: Mon, 7 Nov 2022 14:11:43 GMT
- Title: TLP: A Deep Learning-based Cost Model for Tensor Program Tuning
- Authors: Yi Zhai, Yu Zhang, Shuo Liu, Xiaomeng Chu, Jie Peng, Jianmin Ji,
Yanyong Zhang
- Abstract summary: We propose TLP, a deep learning-based cost model that facilitates tensor program tuning.
We show that TLP can speed up the average search time by 9.1XX on CPU workload.
We incorporate these techniques into the Ansor framework and conduct detailed experiments.
- Score: 15.841139749937351
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tensor program tuning is a non-convex objective optimization problem, to
which search-based approaches have proven to be effective. At the core of the
search-based approaches lies the design of the cost model. Though deep
learning-based cost models perform significantly better than other methods,
they still fall short and suffer from the following problems. First, their
feature extraction heavily relies on expert-level domain knowledge in hardware
architectures. Even so, the extracted features are often unsatisfactory and
require separate considerations for CPUs and GPUs. Second, a cost model trained
on one hardware platform usually performs poorly on another, a problem we call
cross-hardware unavailability.
In order to address these problems, we propose TLP and MTLTLP. TLP is a deep
learning-based cost model that facilitates tensor program tuning. Instead of
extracting features from the tensor program itself, TLP extracts features from
the schedule primitives. We treat schedule primitives as tensor languages. TLP
is thus a Tensor Language Processing task. In this way, the task of predicting
the tensor program latency through the cost model is transformed into a natural
language processing (NLP) regression task. MTL-TLP combines Multi-Task Learning
and TLP to cope with the cross-hardware unavailability problem.
We incorporate these techniques into the Ansor framework and conduct detailed
experiments. Results show that TLP can speed up the average search time by 9.1X
and 3.0X on CPU and GPU workloads, respectively, compared to the
state-of-the-art implementation. MTL-TLP can achieve a speed-up of 4.7X and
2.9X on CPU and GPU workloads, respectively, using only 7% of the target
hardware data.
Related papers
- QiMeng-Xpiler: Transcompiling Tensor Programs for Deep Learning Systems with a Neural-Symbolic Approach [25.521351239401287]
Heterogeneous deep learning systems (DLS) have been widely deployed in industrial data centers.<n>We propose a novel transcompiler, i.e., QiMeng-Xpiler, for automatically translating programs across DLS.<n>As a result, the programming of DLS is improved by up to 9x via transcompiling legacy programs.
arXiv Detail & Related papers (2025-05-04T15:14:27Z) - Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs [76.43407125275202]
o1-like models can emulate human-like long-time thinking during inference.
This paper presents the first comprehensive study on the prevalent issue of overthinking in these models.
We propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy.
arXiv Detail & Related papers (2024-12-30T18:55:12Z) - Parameter-Efficient Transfer Learning for Music Foundation Models [51.61531917413708]
We investigate the use of parameter-efficient transfer learning (PETL) for music foundation models.
PETL methods outperform both probing and fine-tuning on music auto-tagging.
PETL methods achieve similar results as fine-tuning with significantly less training cost.
arXiv Detail & Related papers (2024-11-28T20:50:40Z) - FTuner: A Fast Dynamic Shape Tensors Program Auto-Tuner for Deep Learning Compilers [6.194917248699324]
This paper proposes a new technique for deep learning compilers called FTuner.
Experiments show that the FTuner can achieve comparable operators and end-to-end performance to vendor libraries.
arXiv Detail & Related papers (2024-07-31T08:05:33Z) - Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - Pruner: A Speculative Exploration Mechanism to Accelerate Tensor Program Tuning [9.730351520714699]
Pruner and MoA-Pruner are proposed to speed up program tuning for deep neural networks.
Pruner is a speculative exploration mechanism that accelerates the search process using a "Draft-then-Verify" paradigm.
MoA-Pruner introduces Momentum online Adaptation to address the cross-platform online unawareness.
arXiv Detail & Related papers (2024-02-04T06:11:12Z) - Parameter and Computation Efficient Transfer Learning for
Vision-Language Pre-trained Models [79.34513906324727]
In this paper, we aim at parameter and efficient transfer learning (PCETL) for vision-language pre-trained models.
We propose a novel dynamic architecture skipping (DAS) approach towards effective PCETL.
arXiv Detail & Related papers (2023-09-04T09:34:33Z) - Improving Representational Continuity via Continued Pretraining [76.29171039601948]
Transfer learning community (LP-FT) outperforms naive training and other continual learning methods.
LP-FT also reduces forgetting in a real world satellite remote sensing dataset (FMoW)
variant of LP-FT gets state-of-the-art accuracies on an NLP continual learning benchmark.
arXiv Detail & Related papers (2023-02-26T10:39:38Z) - Decoder Tuning: Efficient Language Understanding as Decoding [84.68266271483022]
We present Decoder Tuning (DecT), which in contrast optimize task-specific decoder networks on the output side.
By gradient-based optimization, DecT can be trained within several seconds and requires only one P query per sample.
We conduct extensive natural language understanding experiments and show that DecT significantly outperforms state-of-the-art algorithms with a $200times$ speed-up.
arXiv Detail & Related papers (2022-12-16T11:15:39Z) - PAL: Program-aided Language Models [112.94785609781503]
We present Program-Aided Language models (PaL) to understand natural language problems.
PaL offloads the solution step to a programmatic runtime such as a Python interpreter.
We set new state-of-the-art results in all 12 benchmarks.
arXiv Detail & Related papers (2022-11-18T18:56:13Z) - Learning to Optimize Permutation Flow Shop Scheduling via Graph-based
Imitation Learning [70.65666982566655]
Permutation flow shop scheduling (PFSS) is widely used in manufacturing systems.
We propose to train the model via expert-driven imitation learning, which accelerates convergence more stably and accurately.
Our model's network parameters are reduced to only 37% of theirs, and the solution gap of our model towards the expert solutions decreases from 6.8% to 1.3% on average.
arXiv Detail & Related papers (2022-10-31T09:46:26Z) - Compressing And Debiasing Vision-Language Pre-Trained Models for Visual
Question Answering [25.540831728925557]
This paper investigates whether a vision-language pre-trained model can be compressed and debiased simultaneously by searching sparse and robustworks.
Our results show that there indeed exist sparse and robustworks, which are competitive with the debiased full.
vehicle.
arXiv Detail & Related papers (2022-10-26T08:25:03Z) - Hidet: Task Mapping Programming Paradigm for Deep Learning Tensor
Programs [11.338285393619042]
We propose to embed the scheduling process into tensor programs and use dedicated mappings, called task mappings, to define the computation assignment and ordering.
With the proposed paradigm, we implement a deep learning compiler - Hidet.
arXiv Detail & Related papers (2022-10-18T05:32:13Z) - Design and Implementation of a Quantum Kernel for Natural Language
Processing [0.8702432681310401]
This thesis leverage the DisCoCat model to design a quantum-based kernel function that can be used by a support vector machine (SVM) for NLP tasks.
Two similarity measures were studied: (i) the transition amplitude approach and (ii) the SWAP test.
The explicit model from previous work was used to train word embeddings and achieved a testing accuracy of $93.09 pm 0.01$%.
arXiv Detail & Related papers (2022-05-13T00:45:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.