FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency
Trade-off in Language Model Inference
- URL: http://arxiv.org/abs/2401.04044v1
- Date: Mon, 8 Jan 2024 17:29:16 GMT
- Title: FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency
Trade-off in Language Model Inference
- Authors: Zirui Liu, Qingquan Song, Qiang Charles Xiao, Sathiya Keerthi
Selvaraj, Rahul Mazumder, Aman Gupta, and Xia Hu
- Abstract summary: This paper shows how to reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
In practice, our method can reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
- Score: 57.119047493787185
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The large number of parameters in Pretrained Language Models enhance their
performance, but also make them resource-intensive, making it challenging to
deploy them on commodity hardware like a single GPU. Due to the memory and
power limitations of these devices, model compression techniques are often used
to decrease both the model's size and its inference latency. This usually
results in a trade-off between model accuracy and efficiency. Therefore,
optimizing this balance is essential for effectively deploying LLMs on
commodity hardware. A significant portion of the efficiency challenge is the
Feed-forward network (FFN) component, which accounts for roughly $\frac{2}{3}$
total parameters and inference latency. In this paper, we first observe that
only a few neurons of FFN module have large output norm for any input tokens,
a.k.a. heavy hitters, while the others are sparsely triggered by different
tokens. Based on this observation, we explicitly split the FFN into two parts
according to the heavy hitters. We improve the efficiency-accuracy trade-off of
existing compression methods by allocating more resource to FFN parts with
heavy hitters. In practice, our method can reduce model size by 43.1\% and
bring $1.25\sim1.56\times$ wall clock time speedup on different hardware with
negligible accuracy drop.
Related papers
- Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points [10.238677144792279]
decoupleQ abandons the traditional quantization paradigm and decouples the model parameters into integer and floating-point parts.
Our method has achieved well on-line accuracy near fp16/bf16 on the 2-bit quantization of large speech models in ByteDance.
arXiv Detail & Related papers (2024-04-19T10:02:53Z) - LookupFFN: Making Transformers Compute-lite for CPU inference [23.61144705380663]
GPU clusters are the de facto choice for training large deep neural network (DNN) models today.
Several reasons including ease of workflow, security and cost have led to efforts investigating whether CPUs may be viable for inference in routine use in many sectors of the industry.
We study a module which is a workhorse within modern architectures, GEMM based Feed Forward Networks (FFNs) and assess the extent to which it can be made compute- (or FLOP-) lite.
arXiv Detail & Related papers (2024-03-12T00:26:16Z) - From PEFT to DEFT: Parameter Efficient Finetuning for Reducing Activation Density in Transformers [52.199303258423306]
We propose a novel density loss that encourages higher activation sparsity in pre-trained models.
Our proposed method, textbfDEFT, can consistently reduce activation density by up to textbf44.94% on RoBERTa$_mathrmLarge$ and by textbf53.19% (encoder density) and textbf90.60% (decoder density) on Flan-T5$_mathrmXXL$.
arXiv Detail & Related papers (2024-02-02T21:25:46Z) - SPT: Fine-Tuning Transformer-based Language Models Efficiently with
Sparsification [14.559316921646356]
Fine-tuning Transformer-based models for downstream tasks has long running time and high memory consumption.
We propose the SPT system to fine-tune Transformer-based models efficiently by introducing sparsity.
SPT consistently outperforms well-optimized baselines, reducing the peak memory consumption by up to 50% and accelerating fine-tuning by up to 2.2x.
arXiv Detail & Related papers (2023-12-16T07:44:52Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - Cheaply Evaluating Inference Efficiency Metrics for Autoregressive
Transformer APIs [66.30706841821123]
Large language models (LLMs) power many state-of-the-art systems in natural language processing.
LLMs are extremely computationally expensive, even at inference time.
We propose a new metric for comparing inference efficiency across models.
arXiv Detail & Related papers (2023-05-03T21:51:42Z) - Efficient NLP Inference at the Edge via Elastic Pipelining [0.42970700836450487]
WRX reconciles the latency/memory tension via two novel techniques.
We build WRX and evaluate it against a range of NLP tasks, under a practical range of target latencies, and on both CPU and GPU.
arXiv Detail & Related papers (2022-07-11T17:15:57Z) - The Right Tool for the Job: Matching Model and Instance Complexities [62.95183777679024]
As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs.
We propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) "exit"
We test our proposed modification on five different datasets in two tasks: three text classification datasets and two natural language inference benchmarks.
arXiv Detail & Related papers (2020-04-16T04:28:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.