Cheaply Evaluating Inference Efficiency Metrics for Autoregressive
Transformer APIs
- URL: http://arxiv.org/abs/2305.02440v1
- Date: Wed, 3 May 2023 21:51:42 GMT
- Title: Cheaply Evaluating Inference Efficiency Metrics for Autoregressive
Transformer APIs
- Authors: Deepak Narayanan, Keshav Santhanam, Peter Henderson, Rishi Bommasani,
Tony Lee, Percy Liang
- Abstract summary: Large language models (LLMs) power many state-of-the-art systems in natural language processing.
LLMs are extremely computationally expensive, even at inference time.
We propose a new metric for comparing inference efficiency across models.
- Score: 66.30706841821123
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) power many state-of-the-art systems in natural
language processing. However, these models are extremely computationally
expensive, even at inference time, raising the natural question: when is the
extra cost of deploying a larger model worth the anticipated boost in
capabilities? Better understanding this tradeoff fundamentally could benefit
from an inference efficiency metric that is both (i) easily comparable across
models from different providers, and (ii) representative of the true cost of
running queries in an isolated performance environment. Unfortunately, access
to LLMs today is largely restricted to black-box text generation APIs and raw
runtimes measured through this interface do not satisfy these desiderata: model
providers can apply various software and hardware optimizations orthogonal to
the model, and models served on shared infrastructure are susceptible to
performance contention. To circumvent these problems, we propose a new metric
for comparing inference efficiency across models. This metric puts models on
equal footing as though they were served (i) on uniform hardware and software,
and (ii) without performance contention. We call this metric the
\emph{idealized runtime}, and we propose a methodology to efficiently estimate
this metric for autoregressive Transformer models. We also propose cost-aware
variants that incorporate the number of accelerators needed to serve the model.
Using these metrics, we compare ten state-of-the-art LLMs to provide the first
analysis of inference efficiency-capability tradeoffs; we make several
observations from this analysis, including the fact that the superior inference
runtime performance of certain APIs is often a byproduct of optimizations
within the API rather than the underlying model. Our methodology also
facilitates the efficient comparison of different software and hardware stacks.
Related papers
- Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models [63.36637269634553]
We present a novel method of further improving performance by requiring models to compare multiple reasoning chains.
We find that instruction tuning on DCoT datasets boosts the performance of even smaller, and therefore more accessible, language models.
arXiv Detail & Related papers (2024-07-03T15:01:18Z) - AXOLOTL: Fairness through Assisted Self-Debiasing of Large Language
Model Outputs [20.772266479533776]
AXOLOTL is a novel post-processing framework that operates agnostically across tasks and models.
It identifies biases, proposes resolutions, and guides the model to self-debias its outputs.
This approach minimizes computational costs and preserves model performance.
arXiv Detail & Related papers (2024-03-01T00:02:37Z) - Leveraging Reinforcement Learning and Large Language Models for Code
Optimization [14.602997316032706]
This paper introduces a new framework to decrease the complexity of code optimization.
The proposed framework builds on large language models (LLMs) and reinforcement learning (RL)
We run several experiments on the PIE dataset using a CodeT5 language model and RRHF, a new reinforcement learning algorithm.
arXiv Detail & Related papers (2023-12-09T19:50:23Z) - Sorted LLaMA: Unlocking the Potential of Intermediate Layers of Large
Language Models for Dynamic Inference [32.62084449979531]
We extend SortedNet to generative NLP tasks by replacing Standard Fine-Tuning (SFT) with Sorted Fine-Tuning (SoFT)
Our approach boosts model efficiency, eliminating the need for multiple models for various scenarios during inference.
Our results show the superior performance of sub-models in comparison to Standard Fine-Tuning and SFT+ICT (Early-Exit)
arXiv Detail & Related papers (2023-09-16T11:58:34Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - An Efficiency Study for SPLADE Models [5.725475501578801]
In this paper, we focus on improving the efficiency of the SPLADE model.
We propose several techniques including L1 regularization for queries, a separation of document/ encoders, a FLOPS-regularized middle-training, and the use of faster query encoders.
arXiv Detail & Related papers (2022-07-08T11:42:05Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore.
We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z) - HULK: An Energy Efficiency Benchmark Platform for Responsible Natural
Language Processing [76.38975568873765]
We introduce HULK, a multi-task energy efficiency benchmarking platform for responsible natural language processing.
We compare pretrained models' energy efficiency from the perspectives of time and cost.
arXiv Detail & Related papers (2020-02-14T01:04:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.