Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster
- URL: http://arxiv.org/abs/2311.08263v2
- Date: Tue, 4 Jun 2024 03:10:34 GMT
- Title: Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster
- Authors: Hongxuan Zhang, Zhining Liu, Yao Zhao, Jiaqi Zheng, Chenyi Zhuang, Jinjie Gu, Guihai Chen,
- Abstract summary: FastCoT is a model-agnostic framework based on parallel decoding.
We show that FastCoT saves inference time by nearly 20% with only a negligible performance drop compared to the regular approach.
- Score: 61.83949316226113
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we propose FastCoT, a model-agnostic framework based on parallel decoding without any further training of an auxiliary model or modification to the LLM itself. FastCoT uses a size-varying context window whose size changes with position to conduct parallel decoding and auto-regressive decoding simultaneously, thus fully utilizing GPU computation resources. In FastCoT, the parallel decoding part provides the LLM with a quick glance of the future composed of approximate tokens, which could lead to faster answers compared to regular autoregressive decoding used by causal transformers. We also provide an implementation of parallel decoding within LLM, which supports KV-cache generation and batch processing. Through extensive experiments, we demonstrate that FastCoT saves inference time by nearly 20% with only a negligible performance drop compared to the regular approach. Additionally, we show that the context window size exhibits considerable robustness for different tasks.
Related papers
- Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption [66.97998742151918]
Large Language Models (LLMs) have revolutionized various industries with their advanced language comprehension.
However, their efficiency is challenged by the Transformer architecture's struggle with handling long texts.
KV Cache has emerged as a pivotal solution, converting the time complexity of token generation from quadratic to linear.
arXiv Detail & Related papers (2024-07-25T12:56:22Z) - Let the Code LLM Edit Itself When You Edit the Code [50.46536185784169]
underlinetextbfPositional textbfIntegrity textbfEncoding (PIE)
PIE reduces computational overhead by over 85% compared to the standard full recomputation approach.
Results demonstrate that PIE reduces computational overhead by over 85% compared to the standard full recomputation approach.
arXiv Detail & Related papers (2024-07-03T14:34:03Z) - UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs [111.12010207132204]
UIO-LLMs is an incremental optimization approach for memory-enhanced transformers under long-context settings.
We refine the training process using the Truncated Backpropagation Through Time (TBPTT) algorithm.
UIO-LLMs successfully handle long context, such as extending the context window of Llama2-7b-chat from 4K to 100K tokens with minimal 2% additional parameters.
arXiv Detail & Related papers (2024-06-26T08:44:36Z) - Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [19.167604927651073]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance.
We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours.
Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z) - Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding [15.723047976314751]
Large language models (LLMs) have become ubiquitous in practice and are widely used for generation tasks such as translation, summarization and instruction following.
We propose a hybrid approach that combines language models of different sizes to increase the efficiency of autoregressive decoding.
arXiv Detail & Related papers (2024-02-26T18:59:28Z) - Break the Sequential Dependency of LLM Inference Using Lookahead
Decoding [27.87483106859749]
Lookahead decoding is an exact, parallel decoding algorithm for large language models (LLMs)
Our implementation can speed up autoregressive decoding by up to 1.8x on MT-bench and 4x with strong scaling on multiple GPUs in code completion tasks.
arXiv Detail & Related papers (2024-02-03T06:37:50Z) - Accelerating Transformer Inference for Translation via Parallel Decoding [2.89306442817912]
Autoregressive decoding limits the efficiency of transformers for Machine Translation (MT)
We present three parallel decoding algorithms and test them on different languages and models.
arXiv Detail & Related papers (2023-05-17T17:57:34Z) - Inference with Reference: Lossless Acceleration of Large Language Models [97.04200102556551]
LLMA is an accelerator to speed up Large Language Model (LLM) inference with references.
It is motivated by the observation that there are abundant identical text spans between the decoding result by an LLM and the reference that is available in many real world scenarios.
arXiv Detail & Related papers (2023-04-10T09:55:14Z) - Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with
Non-Autoregressive Hidden Intermediates [59.678108707409606]
We propose Fast-MD, a fast MD model that generates HI by non-autoregressive decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder.
Fast-MD achieved about 2x and 4x faster decoding speed than that of the na"ive MD model on GPU and CPU with comparable translation quality.
arXiv Detail & Related papers (2021-09-27T05:21:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.