Related papers: TLP: A Deep Learning-based Cost Model for Tensor Program Tuning

TLP: A Deep Learning-based Cost Model for Tensor Program Tuning

URL: http://arxiv.org/abs/2211.03578v1
Date: Mon, 7 Nov 2022 14:11:43 GMT
Title: TLP: A Deep Learning-based Cost Model for Tensor Program Tuning
Authors: Yi Zhai, Yu Zhang, Shuo Liu, Xiaomeng Chu, Jie Peng, Jianmin Ji, Yanyong Zhang
Abstract summary: We propose TLP, a deep learning-based cost model that facilitates tensor program tuning. We show that TLP can speed up the average search time by 9.1XX on CPU workload. We incorporate these techniques into the Ansor framework and conduct detailed experiments.
Score: 15.841139749937351
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Tensor program tuning is a non-convex objective optimization problem, to which search-based approaches have proven to be effective. At the core of the search-based approaches lies the design of the cost model. Though deep learning-based cost models perform significantly better than other methods, they still fall short and suffer from the following problems. First, their feature extraction heavily relies on expert-level domain knowledge in hardware architectures. Even so, the extracted features are often unsatisfactory and require separate considerations for CPUs and GPUs. Second, a cost model trained on one hardware platform usually performs poorly on another, a problem we call cross-hardware unavailability. In order to address these problems, we propose TLP and MTLTLP. TLP is a deep learning-based cost model that facilitates tensor program tuning. Instead of extracting features from the tensor program itself, TLP extracts features from the schedule primitives. We treat schedule primitives as tensor languages. TLP is thus a Tensor Language Processing task. In this way, the task of predicting the tensor program latency through the cost model is transformed into a natural language processing (NLP) regression task. MTL-TLP combines Multi-Task Learning and TLP to cope with the cross-hardware unavailability problem. We incorporate these techniques into the Ansor framework and conduct detailed experiments. Results show that TLP can speed up the average search time by 9.1X and 3.0X on CPU and GPU workloads, respectively, compared to the state-of-the-art implementation. MTL-TLP can achieve a speed-up of 4.7X and 2.9X on CPU and GPU workloads, respectively, using only 7% of the target hardware data.

Related papers

QiMeng-Xpiler: Transcompiling Tensor Programs for Deep Learning Systems with a Neural-Symbolic Approach [25.521351239401287]
Heterogeneous deep learning systems (DLS) have been widely deployed in industrial data centers.<n>We propose a novel transcompiler, i.e., QiMeng-Xpiler, for automatically translating programs across DLS.<n>As a result, the programming of DLS is improved by up to 9x via transcompiling legacy programs.
arXiv Detail & Related papers (2025-05-04T15:14:27Z)
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs [76.43407125275202]
o1-like models can emulate human-like long-time thinking during inference. This paper presents the first comprehensive study on the prevalent issue of overthinking in these models. We propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy.
arXiv Detail & Related papers (2024-12-30T18:55:12Z)
Parameter-Efficient Transfer Learning for Music Foundation Models [51.61531917413708]
We investigate the use of parameter-efficient transfer learning (PETL) for music foundation models. PETL methods outperform both probing and fine-tuning on music auto-tagging. PETL methods achieve similar results as fine-tuning with significantly less training cost.
arXiv Detail & Related papers (2024-11-28T20:50:40Z)
FTuner: A Fast Dynamic Shape Tensors Program Auto-Tuner for Deep Learning Compilers [6.194917248699324]
This paper proposes a new technique for deep learning compilers called FTuner. Experiments show that the FTuner can achieve comparable operators and end-to-end performance to vendor libraries.
arXiv Detail & Related papers (2024-07-31T08:05:33Z)
Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs. We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z)
Pruner: A Speculative Exploration Mechanism to Accelerate Tensor Program Tuning [9.730351520714699]
Pruner and MoA-Pruner are proposed to speed up program tuning for deep neural networks. Pruner is a speculative exploration mechanism that accelerates the search process using a "Draft-then-Verify" paradigm. MoA-Pruner introduces Momentum online Adaptation to address the cross-platform online unawareness.
arXiv Detail & Related papers (2024-02-04T06:11:12Z)
Parameter and Computation Efficient Transfer Learning for Vision-Language Pre-trained Models [79.34513906324727]
In this paper, we aim at parameter and efficient transfer learning (PCETL) for vision-language pre-trained models. We propose a novel dynamic architecture skipping (DAS) approach towards effective PCETL.
arXiv Detail & Related papers (2023-09-04T09:34:33Z)
Improving Representational Continuity via Continued Pretraining [76.29171039601948]
Transfer learning community (LP-FT) outperforms naive training and other continual learning methods. LP-FT also reduces forgetting in a real world satellite remote sensing dataset (FMoW) variant of LP-FT gets state-of-the-art accuracies on an NLP continual learning benchmark.
arXiv Detail & Related papers (2023-02-26T10:39:38Z)
Decoder Tuning: Efficient Language Understanding as Decoding [84.68266271483022]
We present Decoder Tuning (DecT), which in contrast optimize task-specific decoder networks on the output side. By gradient-based optimization, DecT can be trained within several seconds and requires only one P query per sample. We conduct extensive natural language understanding experiments and show that DecT significantly outperforms state-of-the-art algorithms with a $200times$ speed-up.
arXiv Detail & Related papers (2022-12-16T11:15:39Z)
PAL: Program-aided Language Models [112.94785609781503]
We present Program-Aided Language models (PaL) to understand natural language problems. PaL offloads the solution step to a programmatic runtime such as a Python interpreter. We set new state-of-the-art results in all 12 benchmarks.
arXiv Detail & Related papers (2022-11-18T18:56:13Z)
Learning to Optimize Permutation Flow Shop Scheduling via Graph-based Imitation Learning [70.65666982566655]
Permutation flow shop scheduling (PFSS) is widely used in manufacturing systems. We propose to train the model via expert-driven imitation learning, which accelerates convergence more stably and accurately. Our model's network parameters are reduced to only 37% of theirs, and the solution gap of our model towards the expert solutions decreases from 6.8% to 1.3% on average.
arXiv Detail & Related papers (2022-10-31T09:46:26Z)
Compressing And Debiasing Vision-Language Pre-Trained Models for Visual Question Answering [25.540831728925557]
This paper investigates whether a vision-language pre-trained model can be compressed and debiased simultaneously by searching sparse and robustworks. Our results show that there indeed exist sparse and robustworks, which are competitive with the debiased full. vehicle.
arXiv Detail & Related papers (2022-10-26T08:25:03Z)
Hidet: Task Mapping Programming Paradigm for Deep Learning Tensor Programs [11.338285393619042]
We propose to embed the scheduling process into tensor programs and use dedicated mappings, called task mappings, to define the computation assignment and ordering. With the proposed paradigm, we implement a deep learning compiler - Hidet.
arXiv Detail & Related papers (2022-10-18T05:32:13Z)
Design and Implementation of a Quantum Kernel for Natural Language Processing [0.8702432681310401]
This thesis leverage the DisCoCat model to design a quantum-based kernel function that can be used by a support vector machine (SVM) for NLP tasks. Two similarity measures were studied: (i) the transition amplitude approach and (ii) the SWAP test. The explicit model from previous work was used to train word embeddings and achieved a testing accuracy of $93.09 pm 0.01$%.
arXiv Detail & Related papers (2022-05-13T00:45:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.