The N-Grammys: Accelerating Autoregressive Inference with Learning-Free Batched Speculation
- URL: http://arxiv.org/abs/2411.03786v1
- Date: Wed, 06 Nov 2024 09:23:50 GMT
- Title: The N-Grammys: Accelerating Autoregressive Inference with Learning-Free Batched Speculation
- Authors: Lawrence Stewart, Matthew Trager, Sujan Kumar Gonugondla, Stefano Soatto,
- Abstract summary: Speculative decoding aims to speed up autoregressive generation of a language model by verifying in parallel the tokens generated by a smaller draft model.
We show that combinations of simple strategies can achieve significant inference speedups over different tasks.
- Score: 48.52206677611072
- License:
- Abstract: Speculative decoding aims to speed up autoregressive generation of a language model by verifying in parallel the tokens generated by a smaller draft model.In this work, we explore the effectiveness of learning-free, negligible-cost draft strategies, namely $N$-grams obtained from the model weights and the context. While the predicted next token of the base model is rarely the top prediction of these simple strategies, we observe that it is often within their top-$k$ predictions for small $k$. Based on this, we show that combinations of simple strategies can achieve significant inference speedups over different tasks. The overall performance is comparable to more complex methods, yet does not require expensive preprocessing or modification of the base model, and allows for seamless `plug-and-play' integration into pipelines.
Related papers
- Faster Language Models with Better Multi-Token Prediction Using Tensor Decomposition [5.575078692353885]
We propose a new model for multi-token prediction in transformers, aiming to enhance sampling efficiency without compromising accuracy.
By generalizing it to a rank-$r$ canonical probability decomposition, we develop an improved model that predicts multiple tokens simultaneously.
arXiv Detail & Related papers (2024-10-23T11:06:36Z) - How to Leverage Demonstration Data in Alignment for Large Language Model? A Self-Imitation Learning Perspective [17.956310574300765]
This paper introduces a novel generalized self-imitation learning ($textbfGSIL$) framework.
It effectively and efficiently aligns large language models with offline demonstration data.
$textbfGSIL$ consistently and significantly outperforms baselines in many challenging benchmarks.
arXiv Detail & Related papers (2024-10-14T02:21:29Z) - u-$μ$P: The Unit-Scaled Maximal Update Parametrization [4.275373946090221]
We present a new scheme, u-$mu$P, which improves upon $mu$P by combining it with Unit Scaling.
The two techniques have a natural affinity: $mu$P ensures that the scale of activations is independent of model size, and Unit Scaling ensures that activations, weights and gradients begin training with a scale of one.
arXiv Detail & Related papers (2024-07-24T17:58:42Z) - Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines [74.42485647685272]
We focus on Generative Masked Language Models (GMLMs)
We train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model.
We adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality.
arXiv Detail & Related papers (2024-07-22T18:00:00Z) - Efficient Training of Language Models with Compact and Consistent Next Token Distributions [23.312920633391837]
We show that we can train better models faster by pre-aggregating the corpus with a collapsed $n$-gram distribution.
Our approximation facilitates scalability of gains to larger datasets and models.
arXiv Detail & Related papers (2024-07-03T05:40:41Z) - PartIR: Composing SPMD Partitioning Strategies for Machine Learning [1.145010277058103]
We present PartIR, our design for a NN partitioning system.
PartIR is focused on an incremental approach to rewriting and is hardware-and-runtime agnostic.
We evaluate PartIR on several different models to demonstrate its predictability, expressibility, and ability to reach peak performance.
arXiv Detail & Related papers (2024-01-20T10:30:31Z) - PanGu-$\pi$: Enhancing Language Model Architectures via Nonlinearity
Compensation [97.78045712375047]
We present a new efficient model architecture for large language models (LLMs)
We show that PanGu-$pi$-7B can achieve a comparable performance to that of benchmarks with about 10% inference speed-up.
In addition, we have deployed PanGu-$pi$-7B in the high-value domains of finance and law, developing an LLM named YunShan for practical application.
arXiv Detail & Related papers (2023-12-27T11:49:24Z) - FABind: Fast and Accurate Protein-Ligand Binding [127.7790493202716]
$mathbfFABind$ is an end-to-end model that combines pocket prediction and docking to achieve accurate and fast protein-ligand binding.
Our proposed model demonstrates strong advantages in terms of effectiveness and efficiency compared to existing methods.
arXiv Detail & Related papers (2023-10-10T16:39:47Z) - Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free
Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning.
The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences.
The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z) - LAVA NAT: A Non-Autoregressive Translation Model with Look-Around
Decoding and Vocabulary Attention [54.18121922040521]
Non-autoregressive translation (NAT) models generate multiple tokens in one forward pass.
These NAT models often suffer from the multimodality problem, generating duplicated tokens or missing tokens.
We propose two novel methods to address this issue, the Look-Around (LA) strategy and the Vocabulary Attention (VA) mechanism.
arXiv Detail & Related papers (2020-02-08T04:11:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.