Related papers: BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models

BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models

URL: http://arxiv.org/abs/2401.12522v2
Date: Thu, 25 Jan 2024 14:02:03 GMT
Title: BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models
Authors: Feng Lin, Hanling Yi, Hongbin Li, Yifan Yang, Xiaotian Yu, Guangming Lu, Rong Xiao
Abstract summary: Large language models (LLMs) commonly employ autoregressive generation during inference, leading to high memory bandwidth demand and extended latency. We present Bi-directional Tuning for lossless Acceleration (BiTA), an innovative method expediting LLMs via streamlined semi-autoregressive generation and draft verification. The proposed BiTA, LLaMA-2-70B-Chat achieves a 2.7$times$ speedup on the MT-Bench benchmark.
Score: 37.09385961422664
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) commonly employ autoregressive generation during inference, leading to high memory bandwidth demand and consequently extended latency. To mitigate this inefficiency, we present Bi-directional Tuning for lossless Acceleration (BiTA), an innovative method expediting LLMs via streamlined semi-autoregressive generation and draft verification. Inspired by the concept of prompt tuning, we enhance LLMs with a parameter-efficient design called bi-directional tuning for the capability in semi-autoregressive generation. Employing efficient tree-based decoding, the models perform draft candidate generation and verification in parallel, ensuring outputs identical to their autoregressive counterparts under greedy sampling. BiTA serves as a lightweight plug-in module, seamlessly boosting the inference efficiency of existing LLMs without requiring additional assistance models or incurring significant extra memory costs. Applying the proposed BiTA, LLaMA-2-70B-Chat achieves a 2.7$\times$ speedup on the MT-Bench benchmark. Extensive experiments confirm our method surpasses state-of-the-art acceleration techniques.

Related papers

Skip Tuning: Pre-trained Vision-Language Models are Effective and Efficient Adapters Themselves [123.07450481623124]
We propose Skip Tuning as a novel paradigm for adapting vision-language models to downstream tasks. Unlike existing PT or adapter-based methods, Skip Tuning applies Layer-wise Skipping (LSkip) and Class-wise Skipping (CSkip) upon the FT baseline without introducing extra context vectors or adapter modules.
arXiv Detail & Related papers (2024-12-16T07:33:23Z)
SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration [10.970637831760136]
Speculative decoding (SD) has emerged as a widely used paradigm to accelerate the inference of large language models (LLMs) We introduce SWIFT, an on-the-fly self-speculative decoding algorithm that adaptively selects intermediate layers of LLMs to skip during inference. We show that SWIFT can achieve over a 1.3x-1.6x speedup while preserving the original distribution of the generated text.
arXiv Detail & Related papers (2024-10-09T14:15:30Z)
Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion [59.17158389902231]
Speculative decoding has emerged as a widely adopted method to accelerate large language model inference. This paper proposes an adaptation of speculative decoding which uses discrete diffusion models to generate draft sequences.
arXiv Detail & Related papers (2024-08-10T21:24:25Z)
Optimized Multi-Token Joint Decoding with Auxiliary Model for LLM Inference [41.93955876156331]
Large language models (LLMs) have achieved remarkable success across diverse tasks. Their inference processes are hindered by substantial time and energy demands due to single-token generation at each decoding step. We introduce multi-token assisted decoding (MTAD), a novel framework designed to accelerate MTJD.
arXiv Detail & Related papers (2024-07-12T23:29:54Z)
A-SDM: Accelerating Stable Diffusion through Model Assembly and Feature Inheritance Strategies [51.7643024367548]
Stable Diffusion Model is a prevalent and effective model for text-to-image (T2I) and image-to-image (I2I) generation. This study focuses on reducing redundant computation in SDM and optimizing the model through both tuning and tuning-free methods.
arXiv Detail & Related papers (2024-05-31T21:47:05Z)
Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding [2.642212767247493]
We introduce Adaptive N-gram Parallel Decoding (ANPD), which accelerates inference by allowing the simultaneous generation of multiple tokens. ANPD preserves the integrity of the original output while enhancing processing speed. In experiments, models such as LLaMA and its fine-tuned variants have shown speed improvements up to 3.67x.
arXiv Detail & Related papers (2024-04-10T16:11:09Z)
Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models [73.48675708831328]
We propose a novel parameter and computation efficient tuning method for Multi-modal Large Language Models (MLLMs) The Efficient Attention Skipping (EAS) method evaluates the attention redundancy and skips the less important MHAs to speed up inference. The experiments show that EAS not only retains high performance and parameter efficiency, but also greatly speeds up inference speed.
arXiv Detail & Related papers (2024-03-22T14:20:34Z)
DB-LLM: Accurate Dual-Binarization for Efficient LLMs [83.70686728471547]
Large language models (LLMs) have significantly advanced the field of natural language processing. Existing ultra-low-bit quantization always causes severe accuracy drops. We propose a novel Dual-Binarization method for LLMs, namely DB-LLM.
arXiv Detail & Related papers (2024-02-19T09:04:30Z)
Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding [11.832919020149891]
This research aims to accelerate the inference speed of large language models (LLMs) with billions of parameters. We propose textbfSmart textbfParallel textbfAuto-textbfCorrect dtextbfEcoding (SPACE)
arXiv Detail & Related papers (2024-02-19T03:39:10Z)
GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding [81.01996600734616]
We introduce GliDe and CaPE, two low-hassle modifications to vanilla speculative decoding. GliDe is a modified draft model architecture that reuses the cached keys and values from the target LLM. We will release our code, data, and the trained draft models.
arXiv Detail & Related papers (2024-02-03T08:44:11Z)
Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding [25.03122689338891]
We present a novel inference scheme, self-speculative decoding, for accelerating Large Language Models. The proposed method requires no additional neural network training and no extra memory footprint. Benchmarks with LLaMA-2 and its variants demonstrated a speedup up to 1.99$times$.
arXiv Detail & Related papers (2023-09-15T05:34:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.