LLMCad: Fast and Scalable On-device Large Language Model Inference
- URL: http://arxiv.org/abs/2309.04255v1
- Date: Fri, 8 Sep 2023 10:44:19 GMT
- Title: LLMCad: Fast and Scalable On-device Large Language Model Inference
- Authors: Daliang Xu, Wangsong Yin, Xin Jin, Ying Zhang, Shiyun Wei, Mengwei Xu,
and Xuanzhe Liu
- Abstract summary: Generative tasks, such as text generation and question answering, hold a crucial position in the realm of mobile applications.
Currently, the execution of these generative tasks heavily depends on Large Language Models (LLMs)
We introduce LLMCad, an on-device inference engine specifically designed for efficient generative Natural Language Processing (NLP) tasks.
- Score: 11.103824752113148
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generative tasks, such as text generation and question answering, hold a
crucial position in the realm of mobile applications. Due to their sensitivity
to privacy concerns, there is a growing demand for their execution directly on
mobile devices. Currently, the execution of these generative tasks heavily
depends on Large Language Models (LLMs). Nevertheless, the limited memory
capacity of these devices presents a formidable challenge to the scalability of
such models.
In our research, we introduce LLMCad, an innovative on-device inference
engine specifically designed for efficient generative Natural Language
Processing (NLP) tasks. The core idea behind LLMCad revolves around model
collaboration: a compact LLM, residing in memory, takes charge of generating
the most straightforward tokens, while a high-precision LLM steps in to
validate these tokens and rectify any identified errors. LLMCad incorporates
three novel techniques: (1) Instead of generating candidate tokens in a
sequential manner, LLMCad employs the smaller LLM to construct a token tree,
encompassing a wider range of plausible token pathways. Subsequently, the
larger LLM can efficiently validate all of these pathways simultaneously. (2)
It employs a self-adjusting fallback strategy, swiftly initiating the
verification process whenever the smaller LLM generates an erroneous token. (3)
To ensure a continuous flow of token generation, LLMCad speculatively generates
tokens during the verification process by implementing a compute-IO pipeline.
Through an extensive series of experiments, LLMCad showcases an impressive
token generation speed, achieving rates up to 9.3x faster than existing
inference engines.
Related papers
- FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step.
We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z) - SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration [10.970637831760136]
Speculative decoding (SD) has emerged as a widely used paradigm to accelerate the inference of large language models (LLMs)
We introduce SWIFT, an on-the-fly self-speculative decoding algorithm that adaptively selects intermediate layers of LLMs to skip during inference.
We show that SWIFT can achieve over a 1.3x-1.6x speedup while preserving the original distribution of the generated text.
arXiv Detail & Related papers (2024-10-09T14:15:30Z) - Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities.
In-Context Learning (ICL) and.
Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting.
LLMs to downstream tasks.
We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z) - Adaptive Draft-Verification for Efficient Large Language Model Decoding [24.347886232342862]
Large language model (LLM) decoding involves generating a sequence of tokens based on a given context.
The typical autoregressive decoding method requires a separate forward pass through the model for each token generated.
We introduce ADED, which accelerates LLM decoding without requiring fine-tuning.
arXiv Detail & Related papers (2024-06-27T22:20:39Z) - One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models [67.49462724595445]
Retrieval-augmented generation (RAG) is a promising way to improve large language models (LLMs)
We propose a novel method that involves learning scalable and pluggable virtual tokens for RAG.
arXiv Detail & Related papers (2024-05-30T03:44:54Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding [11.832919020149891]
This research aims to accelerate the inference speed of large language models (LLMs) with billions of parameters.
We propose textbfSmart textbfParallel textbfAuto-textbfCorrect dtextbfEcoding (SPACE)
arXiv Detail & Related papers (2024-02-19T03:39:10Z) - SPEED: Speculative Pipelined Execution for Efficient Decoding [35.45955948053644]
We propose SPEED, which improves inference efficiency by speculatively executing multiple future tokens in parallel with the current token.
For Transformer decoders that employ parameter sharing, the memory operations for the tokens executing in parallel can be amortized.
We demonstrate the efficiency of our method in terms of latency reduction relative to model accuracy and demonstrate how speculation allows for training deeper decoders with parameter sharing with minimal runtime overhead.
arXiv Detail & Related papers (2023-10-18T16:07:01Z) - LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation.
We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset.
Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z) - Inference with Reference: Lossless Acceleration of Large Language Models [97.04200102556551]
LLMA is an accelerator to speed up Large Language Model (LLM) inference with references.
It is motivated by the observation that there are abundant identical text spans between the decoding result by an LLM and the reference that is available in many real world scenarios.
arXiv Detail & Related papers (2023-04-10T09:55:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.