Characterizing the Efficiency vs. Accuracy Trade-off for Long-Context
NLP Models
- URL: http://arxiv.org/abs/2204.07288v1
- Date: Fri, 15 Apr 2022 01:52:45 GMT
- Title: Characterizing the Efficiency vs. Accuracy Trade-off for Long-Context
NLP Models
- Authors: Phyllis Ang, Bhuwan Dhingra, Lisa Wu Wills
- Abstract summary: We study the accuracy vs. efficiency trade-off on two widely used long-sequence models.
We find that LED consistently achieves better accuracy at lower energy costs than Big Bird.
For question answering, we find that smaller models are both more efficient and more accurate due to the larger training batch sizes possible.
- Score: 12.062489591946457
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With many real-world applications of Natural Language Processing (NLP)
comprising of long texts, there has been a rise in NLP benchmarks that measure
the accuracy of models that can handle longer input sequences. However, these
benchmarks do not consider the trade-offs between accuracy, speed, and power
consumption as input sizes or model sizes are varied. In this work, we perform
a systematic study of this accuracy vs. efficiency trade-off on two widely used
long-sequence models - Longformer-Encoder-Decoder (LED) and Big Bird - during
fine-tuning and inference on four datasets from the SCROLLS benchmark. To study
how this trade-off differs across hyperparameter settings, we compare the
models across four sequence lengths (1024, 2048, 3072, 4096) and two model
sizes (base and large) under a fixed resource budget. We find that LED
consistently achieves better accuracy at lower energy costs than Big Bird. For
summarization, we find that increasing model size is more energy efficient than
increasing sequence length for higher accuracy. However, this comes at the cost
of a large drop in inference speed. For question answering, we find that
smaller models are both more efficient and more accurate due to the larger
training batch sizes possible under a fixed resource budget.
Related papers
- Entropy Adaptive Decoding: Dynamic Model Switching for Efficient Inference [0.0]
We present Entropy Adaptive Decoding (EAD), a novel approach for efficient language model inference.
EAD switches between different-sized models based on prediction uncertainty.
We show remarkable efficiency gains across different model families.
arXiv Detail & Related papers (2025-02-05T22:15:21Z) - FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency
Trade-off in Language Model Inference [57.119047493787185]
This paper shows how to reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
In practice, our method can reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
arXiv Detail & Related papers (2024-01-08T17:29:16Z) - Astraios: Parameter-Efficient Instruction Tuning Code Large Language
Models [21.17021844323919]
We introduce Astraios, a suite of 28 instruction-tuned OctoCoder models using 7 tuning methods and 4 model sizes up to 16 billion parameters.
We find that FFT leads to the best downstream performance across all scales, and PEFT methods differ significantly in their efficacy based on the model scale.
arXiv Detail & Related papers (2024-01-01T15:30:19Z) - Understanding the Impact of Post-Training Quantization on Large Language
Models [0.38073142980732994]
The study identifies nf4 as displaying greater resilience to temperature variations in the case of the llama2 series of models at lower temperature.
Int8 quantization is associated with significantly slower inference speeds, whereas unquantized bfloat16 models consistently yield the fastest inference speeds across models of all sizes.
arXiv Detail & Related papers (2023-09-11T02:58:32Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Efficiently Scaling Transformer Inference [8.196193683641582]
We study the problem of efficient generative inference for Transformer models, in one of its most challenging settings.
We develop a simple analytical model for inference efficiency to select the best multi-dimensional partitioning techniques optimized for TPU v4 slices.
We achieve a low-batch-size latency of 29ms per token during generation (using int8 weight quantization) and a 76% MFU during large-batch-size processing of input tokens.
arXiv Detail & Related papers (2022-11-09T18:50:38Z) - MoEfication: Conditional Computation of Transformer Models for Efficient
Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z) - AxFormer: Accuracy-driven Approximation of Transformers for Faster,
Smaller and more Accurate NLP Models [4.247712017691596]
AxFormer is a framework that applies accuracy-driven approximations to create optimized transformer models for a given downstream task.
Our experiments show that AxFormer models are up to 4.5% more accurate, while also being up to 2.5X faster and up to 3.2X smaller than conventional fine-tuned models.
arXiv Detail & Related papers (2020-10-07T23:29:34Z) - Real-Time Execution of Large-scale Language Models on Mobile [49.32610509282623]
We find the best model structure of BERT for a given computation size to match specific devices.
Our framework can guarantee the identified model to meet both resource and real-time specifications of mobile devices.
Specifically, our model is 5.2x faster on CPU and 4.1x faster on GPU with 0.5-2% accuracy loss compared with BERT-base.
arXiv Detail & Related papers (2020-09-15T01:59:17Z) - The Right Tool for the Job: Matching Model and Instance Complexities [62.95183777679024]
As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs.
We propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) "exit"
We test our proposed modification on five different datasets in two tasks: three text classification datasets and two natural language inference benchmarks.
arXiv Detail & Related papers (2020-04-16T04:28:08Z) - Train Large, Then Compress: Rethinking Model Size for Efficient Training
and Inference of Transformers [94.43313684188819]
We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute.
We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps.
This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models.
arXiv Detail & Related papers (2020-02-26T21:17:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.