Related papers: FIRP: Faster LLM inference via future intermediate representation prediction

FIRP: Faster LLM inference via future intermediate representation prediction

URL: http://arxiv.org/abs/2410.20488v1
Date: Sun, 27 Oct 2024 15:53:49 GMT
Title: FIRP: Faster LLM inference via future intermediate representation prediction
Authors: Pengfei Wu, Jiahao Liu, Zhuocheng Gong, Qifan Wang, Jinpeng Li, Jingang Wang, Xunliang Cai, Dongyan Zhao,
Abstract summary: FIRP generates multiple tokens instead of one at each decoding step. We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
Score: 54.897493351694195
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recent advancements in Large Language Models (LLMs) have shown remarkable performance across a wide range of tasks. Despite this, the auto-regressive nature of LLM decoding, which generates only a single token per forward propagation, fails to fully exploit the parallel computational power of GPUs, leading to considerable latency. To address this, we introduce a novel speculative decoding method named FIRP which generates multiple tokens instead of one at each decoding step. We achieve this by predicting the intermediate hidden states of future tokens (tokens have not been decoded yet) and then using these pseudo hidden states to decode future tokens, specifically, these pseudo hidden states are predicted with simple linear transformation in intermediate layers of LLMs. Once predicted, they participate in the computation of all the following layers, thereby assimilating richer semantic information. As the layers go deeper, the semantic gap between pseudo and real hidden states is narrowed and it becomes feasible to decode future tokens with high accuracy. To validate the effectiveness of FIRP, we conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets, analytical experiments also prove our motivations.

Related papers

Your LLM Knows the Future: Uncovering Its Multi-Token Prediction Potential [12.719829360337833]
We propose a novel framework that leverages the inherent knowledge of vanilla autoregressive language models about future tokens.<n>Our method achieves significant speedups through supervised fine-tuning on pretrained models.
arXiv Detail & Related papers (2025-07-16T02:31:40Z)
Improving Large Language Models with Concept-Aware Fine-Tuning [55.59287380665864]
Concept-Aware Fine-Tuning (CAFT) is a novel multi-token training method for large language models (LLMs)<n>CAFT enables the learning of sequences that span multiple tokens, fostering stronger concept-aware learning.<n>Experiments demonstrate significant improvements compared to conventional next-token finetuning methods.
arXiv Detail & Related papers (2025-06-09T14:55:00Z)
Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models [50.214593234229255]
We introduce the novel task of Extreme Short Token Reduction, which aims to represent entire videos using a minimal set of discrete tokens.<n>On the Extreme Short Token Reduction task, our VQToken compresses sequences to just 0.07 percent of their original length while incurring only a 0.66 percent drop in accuracy on the NextQA-MC benchmark.
arXiv Detail & Related papers (2025-03-21T09:46:31Z)
Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding [11.07450742824775]
Speculative decoding aims to accelerate the auto-regressive token generation process of a target Large Language Model. Some approaches employ a draft model with multiple heads to predict a sequence of future tokens, where each head handles a token in the sequence. We propose Gumiho, a hybrid model combining serial and parallel heads.
arXiv Detail & Related papers (2025-03-13T07:55:38Z)
Demystifying Singular Defects in Large Language Models [61.98878352956125]
In large language models (LLMs), the underlying causes of high-norm tokens remain largely unexplored.<n>We provide both theoretical insights and empirical validation across a range of recent models.<n>We showcase two practical applications of these findings: the improvement of quantization schemes and the design of LLM signatures.
arXiv Detail & Related papers (2025-02-10T20:09:16Z)
Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference [59.91176945361035]
We introduce Visual Tokens Withdrawal (VTW), a plug-and-play module to boost MLLMs for rapid inference. Our approach is inspired by two intriguing phenomena we have observed. Our VTW approach can cut computational overhead by over 40% across diverse multimodal tasks while maintaining performance.
arXiv Detail & Related papers (2024-05-09T14:38:53Z)
CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation. CodeIP is a novel multi-bit watermarking technique that inserts additional information to preserve provenance details. Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z)
Beyond the Speculative Game: A Survey of Speculative Execution in Large Language Models [9.121458241884444]
Speculative execution is introduced to LLM decoding in a textitdraft-then-verify style. As the costly inference is parallelized, decoding speed can be significantly boosted. We present the first survey paper that reviews and unifies literature of speculative execution in LLMs.
arXiv Detail & Related papers (2024-04-23T10:25:45Z)
Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass. In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z)
Fantastic Semantics and Where to Find Them: Investigating Which Layers of Generative LLMs Reflect Lexical Semantics [50.982315553104975]
We investigate the bottom-up evolution of lexical semantics for a popular large language model, namely Llama2. Our experiments show that the representations in lower layers encode lexical semantics, while the higher layers, with weaker semantic induction, are responsible for prediction. This is in contrast to models with discriminative objectives, such as mask language modeling, where the higher layers obtain better lexical semantics.
arXiv Detail & Related papers (2024-03-03T13:14:47Z)
Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens [15.566726645722657]
We propose a novel framework specifically designed for speculative sampling. Within this framework, we introduce a lightweight draft model that effectively utilizes previously generated tokens to predict subsequent words. We demonstrate impressive results, achieving an average latency speedup ratio of 2.7x compared to the vanilla auto-regressive decoding approach.
arXiv Detail & Related papers (2024-02-24T08:10:39Z)
SPEED: Speculative Pipelined Execution for Efficient Decoding [35.45955948053644]
We propose SPEED, which improves inference efficiency by speculatively executing multiple future tokens in parallel with the current token. For Transformer decoders that employ parameter sharing, the memory operations for the tokens executing in parallel can be amortized. We demonstrate the efficiency of our method in terms of latency reduction relative to model accuracy and demonstrate how speculation allows for training deeper decoders with parameter sharing with minimal runtime overhead.
arXiv Detail & Related papers (2023-10-18T16:07:01Z)
Hot or Cold? Adaptive Temperature Sampling for Code Generation with Large Language Models [54.72004797421481]
We conduct the first systematic study to explore a decoding strategy specialized in code generation. Inspired by the above findings, we propose a simple yet effective method: Adaptive Temperature (AdapT) sampling. Results show that AdapT sampling significantly outperforms state-of-the-art decoding strategy.
arXiv Detail & Related papers (2023-09-06T06:27:33Z)
Fast End-to-End Speech Recognition via a Non-Autoregressive Model and Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once) The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.