TETRIS: Optimal Draft Token Selection for Batch Speculative Decoding
- URL: http://arxiv.org/abs/2502.15197v1
- Date: Fri, 21 Feb 2025 04:19:24 GMT
- Title: TETRIS: Optimal Draft Token Selection for Batch Speculative Decoding
- Authors: Zhaoxuan Wu, Zijian Zhou, Arun Verma, Alok Prakash, Daniela Rus, Bryan Kian Hsiang Low,
- Abstract summary: TETRIS actively selects the most promising draft tokens (for every request in a batch) to be accepted when verified in parallel.<n>We show theoretically and empirically that TETRIS outperforms baseline speculative decoding and existing methods that dynamically select draft tokens.
- Score: 76.23719557942917
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We propose TETRIS, a novel method that optimizes the total throughput of batch speculative decoding in multi-request settings. Unlike existing methods that optimize for a single request or a group of requests as a whole, TETRIS actively selects the most promising draft tokens (for every request in a batch) to be accepted when verified in parallel, resulting in fewer rejected tokens and hence less wasted computing resources. Such an effective resource utilization to achieve fast inference in large language models (LLMs) is especially important to service providers with limited inference capacity. Compared to baseline speculative decoding, TETRIS yields a consistently higher acceptance rate and more effective utilization of the limited inference capacity. We show theoretically and empirically that TETRIS outperforms baseline speculative decoding and existing methods that dynamically select draft tokens, leading to a more efficient batch inference in LLMs.
Related papers
- Collaborative Speculative Inference for Efficient LLM Inference Serving [25.133593066927794]
CoSine is a novel speculative inference system that decouples sequential speculative decoding from parallel verification.
With equivalent resource costs, CoSine achieves up to a 23.2% decrease in latency and a 32.5% increase in throughput compared to baseline methods.
arXiv Detail & Related papers (2025-03-13T13:03:38Z) - Towards Optimal Multi-draft Speculative Decoding [102.67837141152232]
Multi-Draft Speculative Decoding (MDSD) is a recent approach where, when generating each token, a small draft model generates multiple drafts.
This paper discusses the dual of the optimal transport problem, providing a way to efficiently compute the optimal acceptance rate.
arXiv Detail & Related papers (2025-02-26T03:22:44Z) - BATON: Enhancing Batch-wise Inference Efficiency for Large Language Models via Dynamic Re-batching [4.610983384440473]
We propose BATON, an efficient batch-wise LLM inference scheme by dynamically adjusting processing batch.
Compared to the state-of-the-art solution Orca, BATON improves query processing by up to 1.75 times.
arXiv Detail & Related papers (2024-10-24T12:53:39Z) - QPO: Query-dependent Prompt Optimization via Multi-Loop Offline Reinforcement Learning [58.767866109043055]
We introduce Query-dependent Prompt Optimization (QPO), which iteratively fine-tune a small pretrained language model to generate optimal prompts tailored to the input queries.
We derive insights from offline prompting demonstration data, which already exists in large quantities as a by-product of benchmarking diverse prompts on open-sourced tasks.
Experiments on various LLM scales and diverse NLP and math tasks demonstrate the efficacy and cost-efficiency of our method in both zero-shot and few-shot scenarios.
arXiv Detail & Related papers (2024-08-20T03:06:48Z) - Graph-Structured Speculative Decoding [52.94367724136063]
Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models.
We introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses.
We observe a remarkable speedup of 1.73$times$ to 1.96$times$, significantly surpassing standard speculative decoding.
arXiv Detail & Related papers (2024-07-23T06:21:24Z) - On Speeding Up Language Model Evaluation [48.51924035873411]
Development of prompt-based methods with Large Language Models (LLMs) requires making numerous decisions.
We propose a novel method to address this challenge.
We show that it can identify the top-performing method using only 5-15% of the typically needed resources.
arXiv Detail & Related papers (2024-07-08T17:48:42Z) - Adaptive Draft-Verification for Efficient Large Language Model Decoding [24.347886232342862]
Large language model (LLM) decoding involves generating a sequence of tokens based on a given context.
The typical autoregressive decoding method requires a separate forward pass through the model for each token generated.
We introduce ADED, which accelerates LLM decoding without requiring fine-tuning.
arXiv Detail & Related papers (2024-06-27T22:20:39Z) - A Thorough Performance Benchmarking on Lightweight Embedding-based Recommender Systems [67.52782366565658]
State-of-the-art recommender systems (RSs) depend on categorical features, which ecoded by embedding vectors, resulting in excessively large embedding tables.
Despite the prosperity of lightweight embedding-based RSs, a wide diversity is seen in evaluation protocols.
This study investigates various LERS' performance, efficiency, and cross-task transferability via a thorough benchmarking process.
arXiv Detail & Related papers (2024-06-25T07:45:00Z) - Efficient Prompt Optimization Through the Lens of Best Arm Identification [50.56113809171805]
This work provides a principled framework, TRIPLE, to efficiently perform prompt selection under an explicit budget constraint.
It is built on a novel connection established between prompt optimization and fixed-budget best arm identification (BAI-FB) in multi-armed bandits (MAB)
arXiv Detail & Related papers (2024-02-15T05:31:13Z) - JoinGym: An Efficient Query Optimization Environment for Reinforcement
Learning [58.71541261221863]
Join order selection (JOS) is the problem of ordering join operations to minimize total query execution cost.
We present JoinGym, a query optimization environment for bushy reinforcement learning (RL)
Under the hood, JoinGym simulates a query plan's cost by looking up intermediate result cardinalities from a pre-computed dataset.
arXiv Detail & Related papers (2023-07-21T17:00:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.