Related papers: Finding Fast Transformers: One-Shot Neural Architecture Search by Component Composition

Finding Fast Transformers: One-Shot Neural Architecture Search by Component Composition

URL: http://arxiv.org/abs/2008.06808v1
Date: Sat, 15 Aug 2020 23:12:25 GMT
Title: Finding Fast Transformers: One-Shot Neural Architecture Search by Component Composition
Authors: Henry Tsai, Jayden Ooi, Chun-Sung Ferng, Hyung Won Chung, Jason Riesa
Abstract summary: Transformer-based models have achieved stateof-the-art results in many tasks in natural language processing. We develop an efficient algorithm to search for fast models while maintaining model quality.
Score: 11.6409723227448
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Transformer-based models have achieved stateof-the-art results in many tasks in natural language processing. However, such models are usually slow at inference time, making deployment difficult. In this paper, we develop an efficient algorithm to search for fast models while maintaining model quality. We describe a novel approach to decompose the Transformer architecture into smaller components, and propose a sampling-based one-shot architecture search method to find an optimal model for inference. The model search process is more efficient than alternatives, adding only a small overhead to training time. By applying our methods to BERT-base architectures, we achieve 10% to 30% speedup for pre-trained BERT and 70% speedup on top of a previous state-of-the-art distilled BERT model on Cloud TPU-v2 with a generally acceptable drop in performance.

Related papers

A-SDM: Accelerating Stable Diffusion through Redundancy Removal and Performance Optimization [54.113083217869516]
In this work, we first explore the computational redundancy part of the network. We then prune the redundancy blocks of the model and maintain the network performance. Thirdly, we propose a global-regional interactive (GRI) attention to speed up the computationally intensive attention part.
arXiv Detail & Related papers (2023-12-24T15:37:47Z)
AutoDiffusion: Training-Free Optimization of Time Steps and Architectures for Automated Diffusion Model Acceleration [57.846038404893626]
We propose to search the optimal time steps sequence and compressed model architecture in a unified framework to achieve effective image generation for diffusion models without any further training. Experimental results show that our method achieves excellent performance by using only a few time steps, e.g. 17.86 FID score on ImageNet 64 $times$ 64 with only four steps, compared to 138.66 with DDIM.
arXiv Detail & Related papers (2023-09-19T08:57:24Z)
MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed. We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z)
A Fast Post-Training Pruning Framework for Transformers [74.59556951906468]
Pruning is an effective way to reduce the huge inference cost of large Transformer models. Prior work on model pruning requires retraining the model. We propose a fast post-training pruning framework for Transformers that does not require any retraining.
arXiv Detail & Related papers (2022-03-29T07:41:11Z)
AutoBERT-Zero: Evolving BERT Backbone from Scratch [94.89102524181986]
We propose an Operation-Priority Neural Architecture Search (OP-NAS) algorithm to automatically search for promising hybrid backbone architectures. We optimize both the search algorithm and evaluation of candidate models to boost the efficiency of our proposed OP-NAS. Experiments show that the searched architecture (named AutoBERT-Zero) significantly outperforms BERT and its variants of different model capacities in various downstream tasks.
arXiv Detail & Related papers (2021-07-15T16:46:01Z)
Real-Time Execution of Large-scale Language Models on Mobile [49.32610509282623]
We find the best model structure of BERT for a given computation size to match specific devices. Our framework can guarantee the identified model to meet both resource and real-time specifications of mobile devices. Specifically, our model is 5.2x faster on CPU and 4.1x faster on GPU with 0.5-2% accuracy loss compared with BERT-base.
arXiv Detail & Related papers (2020-09-15T01:59:17Z)
Deep-n-Cheap: An Automated Search Framework for Low Complexity Deep Learning [3.479254848034425]
We present Deep-n-Cheap -- an open-source AutoML framework to search for deep learning models. Our framework is targeted for deployment on both benchmark and custom datasets. Deep-n-Cheap includes a user-customizable complexity penalty which trades off performance with training time or number of parameters.
arXiv Detail & Related papers (2020-03-27T13:00:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.