Pruner: A Speculative Exploration Mechanism to Accelerate Tensor Program Tuning
- URL: http://arxiv.org/abs/2402.02361v2
- Date: Sat, 29 Jun 2024 12:57:39 GMT
- Title: Pruner: A Speculative Exploration Mechanism to Accelerate Tensor Program Tuning
- Authors: Liang Qiao, Jun Shi, Xiaoyu Hao, Xi Fang, Minfan Zhao, Ziqi Zhu, Junshi Chen, Hong An, Bing Li, Honghui Yuan, Xinyang Wang, Xulong Tang,
- Abstract summary: Pruner and MoA-Pruner are proposed to speed up program tuning for deep neural networks.
Pruner is a speculative exploration mechanism that accelerates the search process using a "Draft-then-Verify" paradigm.
MoA-Pruner introduces Momentum online Adaptation to address the cross-platform online unawareness.
- Score: 9.730351520714699
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tensor program tuning is essential for the efficient deployment of deep neural networks. Search-based approaches have demonstrated scalability and effectiveness in automatically finding high-performance programs for specific hardware. However, the search process is often inefficient, taking hours or even days to discover optimal programs due to the exploration mechanisms guided by an accurate but slow learned cost model. Meanwhile, the learned cost model trained on one platform cannot seamlessly adapt online to another, which we call cross-platform online unawareness. In this work, we propose Pruner and MoA-Pruner. Pruner is a speculative exploration mechanism that accelerates the search process using a "Draft-then-Verify" paradigm. Instead of applying the complex learned cost model to all explored candidates, Pruner drafts small-scale speculative candidates by introducing a naive symbol analyzer (draft model), then identifies the best candidates by the learned cost model. MoA-Pruner introduces Momentum online Adaptation to address the cross-platform online unawareness. We incorporate these techniques into the Ansor and conduct extensive experiments on three GPU-based platforms. Results show that in online cost model tuning scenarios, Pruner and MoA-Pruner can achieve an average speedup of $2.6 \times$ and $4.82 \times$ compared to Ansor. In offline tuning scenarios, Pruner can achieve an average speedup of $4.75 \times$ and $4.05\times$ compared to TenSet and TLP, respectively. The code is available at https://github.com/qiaolian9/Pruner.
Related papers
- $\ exttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts [55.231201692232894]
$textttSPECS$ is a latency-aware test-time scaling method inspired by speculative decoding.<n>Our results show that $textttSPECS$matches or surpasses beam search accuracy while reducing latency by up to $sim$19.1%.
arXiv Detail & Related papers (2025-06-15T05:50:05Z) - ETS: Efficient Tree Search for Inference-Time Scaling [61.553681244572914]
One promising approach for test-time compute scaling is search against a process reward model.
diversity of trajectories in the tree search process affects the accuracy of the search, since increasing diversity promotes more exploration.
We propose Efficient Tree Search (ETS), which promotes KV sharing by pruning redundant trajectories while maintaining necessary diverse trajectories.
arXiv Detail & Related papers (2025-02-19T09:30:38Z) - A hybrid framework for effective and efficient machine unlearning [12.499101994047862]
Machine unlearning (MU) is proposed to remove the imprints of revoked samples from the already trained model parameters.
We present a novel hybrid strategy on top of them to achieve an overall success.
arXiv Detail & Related papers (2024-12-19T03:59:26Z) - Reducing Hyperparameter Tuning Costs in ML, Vision and Language Model Training Pipelines via Memoization-Awareness [19.24063761779741]
We propose a "memoization-aware" Bayesian Optimization (BO) algorithm, EEIPU, that works in tandem with a pipeline caching system.
In our benchmarks on machine learning (model ensembles), vision (convolutional architecture) and language (T5 architecture) pipelines, we compare EEIPU against recent BO algorithms.
arXiv Detail & Related papers (2024-11-06T07:53:04Z) - Finding Transformer Circuits with Edge Pruning [71.12127707678961]
We propose Edge Pruning as an effective and scalable solution to automated circuit discovery.
Our method finds circuits in GPT-2 that use less than half the number of edges compared to circuits found by previous methods.
Thanks to its efficiency, we scale Edge Pruning to CodeLlama-13B, a model over 100x the scale that prior methods operate on.
arXiv Detail & Related papers (2024-06-24T16:40:54Z) - Graspness Discovery in Clutters for Fast and Accurate Grasp Detection [57.81325062171676]
"graspness" is a quality based on geometry cues that distinguishes graspable areas in cluttered scenes.
We develop a neural network named cascaded graspness model to approximate the searching process.
Experiments on a large-scale benchmark, GraspNet-1Billion, show that our method outperforms previous arts by a large margin.
arXiv Detail & Related papers (2024-06-17T02:06:47Z) - FootstepNet: an Efficient Actor-Critic Method for Fast On-line Bipedal Footstep Planning and Forecasting [0.0]
We propose an efficient footstep planning method to navigate in local environments with obstacles.
We also propose a forecasting method, allowing to quickly estimate the number of footsteps required to reach different candidates of local targets.
We demonstrate the validity of our approach with simulation results, and by a deployment on a kid-size humanoid robot during the RoboCup 2023 competition.
arXiv Detail & Related papers (2024-03-19T09:48:18Z) - Accelerating Greedy Coordinate Gradient and General Prompt Optimization via Probe Sampling [40.535672813968375]
Safety of Large Language Models (LLMs) has become a critical issue given their rapid progresses.
We study a new algorithm called $ttexttProbe sampling$ to reduce the time cost of GCG.
Probe sampling is also able to accelerate other prompt optimization techniques and adversarial methods.
arXiv Detail & Related papers (2024-03-02T16:23:44Z) - Efficient Verification-Based Face Identification [50.616875565173274]
We study the problem of performing face verification with an efficient neural model $f$.
Our model leads to a substantially small $f$ requiring only 23k parameters and 5M floating point operations (FLOPS)
We use six face verification datasets to demonstrate that our method is on par or better than state-of-the-art models.
arXiv Detail & Related papers (2023-12-20T18:08:02Z) - Tune As You Scale: Hyperparameter Optimization For Compute Efficient
Training [0.0]
We propose a practical method for robustly tuning large models.
CarBS performs local search around the performance-cost frontier.
Among our results, we effectively solve the entire ProcGen benchmark just by tuning a simple baseline.
arXiv Detail & Related papers (2023-06-13T18:22:24Z) - Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo [104.9535542833054]
We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL)
We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo.
Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
arXiv Detail & Related papers (2023-05-29T17:11:28Z) - TLP: A Deep Learning-based Cost Model for Tensor Program Tuning [15.841139749937351]
We propose TLP, a deep learning-based cost model that facilitates tensor program tuning.
We show that TLP can speed up the average search time by 9.1XX on CPU workload.
We incorporate these techniques into the Ansor framework and conduct detailed experiments.
arXiv Detail & Related papers (2022-11-07T14:11:43Z) - Late Prompt Tuning: A Late Prompt Could Be Better Than Many Prompts [97.20933523766182]
Prompt tuning is a parameter-efficient tuning (PETuning) method for utilizing pre-trained models (PTMs)
We present Late Prompt Tuning () that inserts a late prompt into an intermediate layer of the PTM instead of the input layer or all layers.
We show that, can achieve competitive performance to full model tuning and other PETuning methods under both full-data and few-shot scenarios.
arXiv Detail & Related papers (2022-10-20T14:23:52Z) - Efficient Offline Policy Optimization with a Learned Model [83.64779942889916]
MuZero Unplugged presents a promising approach for offline policy learning from logged data.
It conducts Monte-Carlo Tree Search (MCTS) with a learned model and leverages Reanalyze algorithm to learn purely from offline data.
This paper investigates a few hypotheses where MuZero Unplugged may not work well under the offline settings.
arXiv Detail & Related papers (2022-10-12T07:41:04Z) - APP: Anytime Progressive Pruning [104.36308667437397]
We propose a novel way of training a neural network with a target sparsity in a particular case of online learning: the anytime learning at macroscale paradigm (ALMA)
The proposed approach significantly outperforms the baseline dense and Anytime OSP models across multiple architectures and datasets under short, moderate, and long-sequence training.
arXiv Detail & Related papers (2022-04-04T16:38:55Z) - Accelerating Training and Inference of Graph Neural Networks with Fast
Sampling and Pipelining [58.10436813430554]
Mini-batch training of graph neural networks (GNNs) requires a lot of computation and data movement.
We argue in favor of performing mini-batch training with neighborhood sampling in a distributed multi-GPU environment.
We present a sequence of improvements to mitigate these bottlenecks, including a performance-engineered neighborhood sampler.
We also conduct an empirical analysis that supports the use of sampling for inference, showing that test accuracies are not materially compromised.
arXiv Detail & Related papers (2021-10-16T02:41:35Z) - One-Cycle Pruning: Pruning ConvNets Under a Tight Training Budget [0.0]
Introducing sparsity in a neural network has been an efficient way to reduce its complexity while keeping its performance almost intact.
Most of the time, sparsity is introduced using a three-stage pipeline: 1) train the model to convergence, 2) prune the model according to some criterion, 3) fine-tune the pruned model to recover performance.
In our work, we propose to get rid of the first step of the pipeline and to combine the two other steps in a single pruning-training cycle.
arXiv Detail & Related papers (2021-07-05T15:27:07Z) - Online Apprenticeship Learning [58.45089581278177]
In Apprenticeship Learning (AL), we are given a Markov Decision Process (MDP) without access to the cost function.
The goal is to find a policy that matches the expert's performance on some predefined set of cost functions.
We show that the OAL problem can be effectively solved by combining two mirror descent based no-regret algorithms.
arXiv Detail & Related papers (2021-02-13T12:57:51Z) - Ansor: Generating High-Performance Tensor Programs for Deep Learning [45.437816016043534]
We present Ansor, a tensor program generation framework for deep learning applications.
Ansor explores many more optimization combinations by sampling programs from a hierarchical representation of the search space.
Ansor can find high-performance programs that are outside the search space of existing state-of-the-art approaches.
arXiv Detail & Related papers (2020-06-11T19:40:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.