SwiftPruner: Reinforced Evolutionary Pruning for Efficient Ad Relevance
- URL: http://arxiv.org/abs/2209.00625v1
- Date: Tue, 30 Aug 2022 03:05:56 GMT
- Title: SwiftPruner: Reinforced Evolutionary Pruning for Efficient Ad Relevance
- Authors: Li Lyna Zhang, Youkow Homma, Yujing Wang, Min Wu, Mao Yang, Ruofei
Zhang, Ting Cao, Wei Shen
- Abstract summary: This work aims to design a new, low-latency BERT via structured pruning to empower real-time online inference for cold start ads relevance on a CPU platform.
In this paper, we propose SwiftPruner - an efficient framework that leverages evolution-based search to automatically find the best-performing layer-wise sparse BERT model.
- Score: 19.930169700686672
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Ad relevance modeling plays a critical role in online advertising systems
including Microsoft Bing. To leverage powerful transformers like BERT in this
low-latency setting, many existing approaches perform ad-side computations
offline. While efficient, these approaches are unable to serve cold start ads,
resulting in poor relevance predictions for such ads. This work aims to design
a new, low-latency BERT via structured pruning to empower real-time online
inference for cold start ads relevance on a CPU platform. Our challenge is that
previous methods typically prune all layers of the transformer to a high,
uniform sparsity, thereby producing models which cannot achieve satisfactory
inference speed with an acceptable accuracy.
In this paper, we propose SwiftPruner - an efficient framework that leverages
evolution-based search to automatically find the best-performing layer-wise
sparse BERT model under the desired latency constraint. Different from existing
evolution algorithms that conduct random mutations, we propose a reinforced
mutator with a latency-aware multi-objective reward to conduct better mutations
for efficiently searching the large space of layer-wise sparse models.
Extensive experiments demonstrate that our method consistently achieves higher
ROC AUC and lower latency than the uniform sparse baseline and state-of-the-art
search methods. Remarkably, under our latency requirement of 1900us on CPU,
SwiftPruner achieves a 0.86% higher AUC than the state-of-the-art uniform
sparse baseline for BERT-Mini on a large scale real-world dataset. Online A/B
testing shows that our model also achieves a significant 11.7% cut in the ratio
of defective cold start ads with satisfactory real-time serving latency.
Related papers
- FORA: Fast-Forward Caching in Diffusion Transformer Acceleration [39.51519525071639]
Diffusion transformers (DiT) have become the de facto choice for generating high-quality images and videos.
Fast-FORward CAching (FORA) is designed to accelerate DiT by exploiting the repetitive nature of the diffusion process.
arXiv Detail & Related papers (2024-07-01T16:14:37Z) - PUMA: margin-based data pruning [51.12154122266251]
We focus on data pruning, where some training samples are removed based on the distance to the model classification boundary (i.e., margin)
We propose PUMA, a new data pruning strategy that computes the margin using DeepFool.
We show that PUMA can be used on top of the current state-of-the-art methodology in robustness, and it is able to significantly improve the model performance unlike the existing data pruning strategies.
arXiv Detail & Related papers (2024-05-10T08:02:20Z) - etuner: A Redundancy-Aware Framework for Efficient Continual Learning Application on Edge Devices [47.365775210055396]
We propose ETuner, an efficient edge continual learning framework that optimize inference accuracy, fine-tuning execution time, and energy efficiency.
Experimental results show that, on average, ETuner reduces overall fine-tuning execution time by 64%, energy consumption by 56%, and improves average inference accuracy by 1.75% over the immediate model fine-tuning approach.
arXiv Detail & Related papers (2024-01-30T02:41:05Z) - Efficient Architecture Search via Bi-level Data Pruning [70.29970746807882]
This work pioneers an exploration into the critical role of dataset characteristics for DARTS bi-level optimization.
We introduce a new progressive data pruning strategy that utilizes supernet prediction dynamics as the metric.
Comprehensive evaluations on the NAS-Bench-201 search space, DARTS search space, and MobileNet-like search space validate that BDP reduces search costs by over 50%.
arXiv Detail & Related papers (2023-12-21T02:48:44Z) - E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning [55.50908600818483]
Fine-tuning large-scale pretrained vision models for new tasks has become increasingly parameter-intensive.
We propose an Effective and Efficient Visual Prompt Tuning (E2VPT) approach for large-scale transformer-based model adaptation.
Our approach outperforms several state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2023-07-25T19:03:21Z) - Constraint-aware and Ranking-distilled Token Pruning for Efficient
Transformer Inference [18.308180927492643]
ToP is a ranking-distilled token distillation technique, which distills effective token rankings from the final layer of unpruned models to early layers of pruned models.
ToP reduces the average FLOPs of BERT by 8.1x while achieving competitive accuracy on GLUE, and provides a real latency speedup of up to 7.4x on an Intel CPU.
arXiv Detail & Related papers (2023-06-26T03:06:57Z) - COPR: Consistency-Oriented Pre-Ranking for Online Advertising [27.28920707332434]
We introduce a consistency-oriented pre-ranking framework for online advertising.
It employs a chunk-based sampling module and a plug-and-play rank alignment module to explicitly optimize consistency of ECPM-ranked results.
When deployed in Taobao display advertising system, it achieves an improvement of up to +12.3% CTR and +5.6% RPM.
arXiv Detail & Related papers (2023-06-06T09:08:40Z) - An Efficiency Study for SPLADE Models [5.725475501578801]
In this paper, we focus on improving the efficiency of the SPLADE model.
We propose several techniques including L1 regularization for queries, a separation of document/ encoders, a FLOPS-regularized middle-training, and the use of faster query encoders.
arXiv Detail & Related papers (2022-07-08T11:42:05Z) - Efficient Few-Shot Object Detection via Knowledge Inheritance [62.36414544915032]
Few-shot object detection (FSOD) aims at learning a generic detector that can adapt to unseen tasks with scarce training samples.
We present an efficient pretrain-transfer framework (PTF) baseline with no computational increment.
We also propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights.
arXiv Detail & Related papers (2022-03-23T06:24:31Z) - GDP: Stabilized Neural Network Pruning via Gates with Differentiable
Polarization [84.57695474130273]
Gate-based or importance-based pruning methods aim to remove channels whose importance is smallest.
GDP can be plugged before convolutional layers without bells and whistles, to control the on-and-off of each channel.
Experiments conducted over CIFAR-10 and ImageNet datasets show that the proposed GDP achieves the state-of-the-art performance.
arXiv Detail & Related papers (2021-09-06T03:17:10Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.