LiteTransformerSearch: Training-free On-device Search for Efficient
Autoregressive Language Models
- URL: http://arxiv.org/abs/2203.02094v1
- Date: Fri, 4 Mar 2022 02:10:43 GMT
- Title: LiteTransformerSearch: Training-free On-device Search for Efficient
Autoregressive Language Models
- Authors: Mojan Javaheripi, Shital Shah, Subhabrata Mukherjee, Tomasz L. Religa,
Caio C. T. Mendes, Gustavo H. de Rosa, Sebastien Bubeck, Farinaz Koushanfar,
Debadeepta Dey
- Abstract summary: We show that the latency and perplexity pareto-frontier can be found without need for any model training.
We evaluate our method, dubbed Lightweight Transformer Search (LTS), on diverse devices.
We show that the perplexity of Transformer-XL can be achieved with up to 2x lower latency.
- Score: 34.673688610935876
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The transformer architecture is ubiquitously used as the building block of
most large-scale language models. However, it remains a painstaking guessing
game of trial and error to set its myriad of architectural hyperparameters,
e.g., number of layers, number of attention heads, and inner size of the feed
forward network, and find architectures with the optimal trade-off between task
performance like perplexity and compute constraints like memory and latency.
This challenge is further exacerbated by the proliferation of various hardware.
In this work, we leverage the somewhat surprising empirical observation that
the number of non-embedding parameters in autoregressive transformers has a
high rank correlation with task performance, irrespective of the architectural
hyperparameters. Since architectural hyperparameters affect the latency and
memory footprint in a hardware-dependent manner, the above observation
organically induces a simple search algorithm that can be directly run on
target devices. We rigorously show that the latency and perplexity
pareto-frontier can be found without need for any model training, using
non-embedding parameters as a proxy for perplexity. We evaluate our method,
dubbed Lightweight Transformer Search (LTS), on diverse devices from ARM CPUs
to Nvidia GPUs and show that the perplexity of Transformer-XL can be achieved
with up to 2x lower latency. LTS extracts the pareto-frontier in less than 3
hours while running on a commodity laptop. We effectively remove the carbon
footprint of training for hundreds of GPU hours, offering a strong simple
baseline for future NAS methods in autoregressive language modeling.
Related papers
- Exploring Dynamic Transformer for Efficient Object Tracking [58.120191254379854]
We propose DyTrack, a dynamic transformer framework for efficient tracking.
DyTrack automatically learns to configure proper reasoning routes for various inputs, gaining better utilization of the available computational budget.
Experiments on multiple benchmarks demonstrate that DyTrack achieves promising speed-precision trade-offs with only a single model.
arXiv Detail & Related papers (2024-03-26T12:31:58Z) - Multi-objective Differentiable Neural Architecture Search [58.67218773054753]
We propose a novel NAS algorithm that encodes user preferences for the trade-off between performance and hardware metrics.
Our method outperforms existing MOO NAS methods across a broad range of qualitatively different search spaces and datasets.
arXiv Detail & Related papers (2024-02-28T10:09:04Z) - LMUFormer: Low Complexity Yet Powerful Spiking Model With Legendre
Memory Units [5.830814457423021]
Transformer models have demonstrated high accuracy in numerous applications but have high complexity and lack sequential processing capability.
We show how architectural modifications to a recurrent model can help push its performance toward Transformer models.
We present a spiking version of this architecture, which introduces the benefit of states within the patch embedding and channel mixer modules.
arXiv Detail & Related papers (2024-01-20T01:10:18Z) - Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture
with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE)
MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA.
Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Communication-Efficient TeraByte-Scale Model Training Framework for
Online Advertising [32.5337643852876]
Click-Through Rate (CTR) prediction is a crucial component in the online advertising industry.
We identify two major challenges in the existing GPU training for massivescale ad models.
We propose a hardware-aware training workflow that couples the hardware topology into the algorithm design.
arXiv Detail & Related papers (2022-01-05T18:09:11Z) - Learned Queries for Efficient Local Attention [11.123272845092611]
Self-attention mechanism in vision transformers suffers from high latency and inefficient memory utilization.
We propose a new shift-invariant local attention layer, called query and attend (QnA), that aggregates the input locally in an overlapping manner.
We show improvements in speed and memory complexity while achieving comparable accuracy with state-of-the-art models.
arXiv Detail & Related papers (2021-12-21T18:52:33Z) - PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result.
Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z) - Long-Short Transformer: Efficient Transformers for Language and Vision [97.2850205384295]
Long-Short Transformer (Transformer-LS) is an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks.
It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations.
Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification.
arXiv Detail & Related papers (2021-07-05T18:00:14Z) - Layered gradient accumulation and modular pipeline parallelism: fast and
efficient training of large language models [0.0]
We analyse the shortest possible training time for different configurations of distributed training.
We introduce two new methods, textitlayered gradient accumulation and textitmodular pipeline parallelism, which together cut the shortest training time by half.
arXiv Detail & Related papers (2021-06-04T19:21:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.