AxFormer: Accuracy-driven Approximation of Transformers for Faster,
Smaller and more Accurate NLP Models
- URL: http://arxiv.org/abs/2010.03688v2
- Date: Fri, 10 Jun 2022 01:02:42 GMT
- Title: AxFormer: Accuracy-driven Approximation of Transformers for Faster,
Smaller and more Accurate NLP Models
- Authors: Amrit Nagarajan, Sanchari Sen, Jacob R. Stevens, Anand Raghunathan
- Abstract summary: AxFormer is a framework that applies accuracy-driven approximations to create optimized transformer models for a given downstream task.
Our experiments show that AxFormer models are up to 4.5% more accurate, while also being up to 2.5X faster and up to 3.2X smaller than conventional fine-tuned models.
- Score: 4.247712017691596
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers have greatly advanced the state-of-the-art in Natural Language
Processing (NLP) in recent years, but present very large computation and
storage requirements. We observe that the design process of Transformers
(pre-train a foundation model on a large dataset in a self-supervised manner,
and subsequently fine-tune it for different downstream tasks) leads to
task-specific models that are highly over-parameterized, adversely impacting
both accuracy and inference efficiency. We propose AxFormer, a systematic
framework that applies accuracy-driven approximations to create optimized
transformer models for a given downstream task. AxFormer combines two key
optimizations -- accuracy-driven pruning and selective hard attention.
Accuracy-driven pruning identifies and removes parts of the fine-tuned
transformer that hinder performance on the given downstream task. Sparse
hard-attention optimizes attention blocks in selected layers by eliminating
irrelevant word aggregations, thereby helping the model focus only on the
relevant parts of the input. In effect, AxFormer leads to models that are more
accurate, while also being faster and smaller. Our experiments on GLUE and
SQUAD tasks show that AxFormer models are up to 4.5% more accurate, while also
being up to 2.5X faster and up to 3.2X smaller than conventional fine-tuned
models. In addition, we demonstrate that AxFormer can be combined with previous
efforts such as distillation or quantization to achieve further efficiency
gains.
Related papers
- Fourier Transformer: Fast Long Range Modeling by Removing Sequence
Redundancy with FFT Operator [24.690247474891958]
Fourier Transformer is able to significantly reduce computational costs while retain the ability to inherit from various large pretrained models.
Our model achieves state-of-the-art performances among all transformer-based models on the long-range modeling benchmark LRA.
For generative seq-to-seq tasks including CNN/DailyMail and ELI5, by inheriting the BART weights our model outperforms the standard BART.
arXiv Detail & Related papers (2023-05-24T12:33:06Z) - Transformers meet Stochastic Block Models: Attention with Data-Adaptive
Sparsity and Cost [53.746169882193456]
Recent works have proposed various sparse attention modules to overcome the quadratic cost of self-attention.
We propose a model that resolves both problems by endowing each attention head with a mixed-membership Block Model.
Our model outperforms previous efficient variants as well as the original Transformer with full attention.
arXiv Detail & Related papers (2022-10-27T15:30:52Z) - Model soups: averaging weights of multiple fine-tuned models improves
accuracy without increasing inference time [69.7693300927423]
We show that averaging the weights of multiple models fine-tuned with different hyper parameter configurations improves accuracy and robustness.
We show that the model soup approach extends to multiple image classification and natural language processing tasks.
arXiv Detail & Related papers (2022-03-10T17:03:49Z) - Scatterbrain: Unifying Sparse and Low-rank Attention Approximation [25.375024028636663]
We propose Scatterbrain, a novel way to unify sparse (via locality sensitive hashing) and low-rank (via kernel feature map) attention for accurate approximation.
We empirically show that Scatterbrain can achieve 2.1x lower error than baselines when serving as a drop-in replacement in BigGAN image generation and pre-trained T2T-ViT.
We demonstrate Scatterbrain for end-to-end training with up to 4 points better perplexity and 5 points better average accuracy than sparse or low-rank efficient transformers on language modeling and long-range-arena tasks.
arXiv Detail & Related papers (2021-10-28T17:52:17Z) - MoEfication: Conditional Computation of Transformer Models for Efficient
Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z) - Predicting Attention Sparsity in Transformers [0.9786690381850356]
We propose Sparsefinder, a model trained to identify the sparsity pattern of entmax attention before computing it.
Our work provides a new angle to study model efficiency by doing extensive analysis of the tradeoff between the sparsity and recall of the predicted attention graph.
arXiv Detail & Related papers (2021-09-24T20:51:21Z) - DoT: An efficient Double Transformer for NLP tasks with tables [3.0079490585515343]
DoT is a double transformer model that decomposes the problem into two sub-tasks.
We show that for a small drop of accuracy, DoT improves training and inference time by at least 50%.
arXiv Detail & Related papers (2021-06-01T13:33:53Z) - Efficient pre-training objectives for Transformers [84.64393460397471]
We study several efficient pre-training objectives for Transformers-based models.
We prove that eliminating the MASK token and considering the whole output during the loss are essential choices to improve performance.
arXiv Detail & Related papers (2021-04-20T00:09:37Z) - The Right Tool for the Job: Matching Model and Instance Complexities [62.95183777679024]
As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs.
We propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) "exit"
We test our proposed modification on five different datasets in two tasks: three text classification datasets and two natural language inference benchmarks.
arXiv Detail & Related papers (2020-04-16T04:28:08Z) - Train Large, Then Compress: Rethinking Model Size for Efficient Training
and Inference of Transformers [94.43313684188819]
We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute.
We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps.
This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models.
arXiv Detail & Related papers (2020-02-26T21:17:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.