Faster Cascades via Speculative Decoding
- URL: http://arxiv.org/abs/2405.19261v1
- Date: Wed, 29 May 2024 16:55:08 GMT
- Title: Faster Cascades via Speculative Decoding
- Authors: Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat, Seungyeon Kim, Neha Gupta, Aditya Krishna Menon, Sanjiv Kumar,
- Abstract summary: We propose new speculative cascading techniques that implement their deferral rule through speculative execution.
We show that the proposed approach yields better cost-quality trade-offs than cascading and speculative decoding baselines.
- Score: 66.16909847419198
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cascades and speculative decoding are two common approaches to improving language models' inference efficiency. Both approaches involve interleaving models of different sizes, but via fundamentally distinct mechanisms: cascades employ a deferral rule that invokes the larger model only for "hard" inputs, while speculative decoding uses speculative execution to primarily invoke the larger model in parallel verification mode. These mechanisms offer different benefits: empirically, cascades are often capable of yielding better quality than even the larger model, while theoretically, speculative decoding offers a guarantee of quality-neutrality. In this paper, we leverage the best of both these approaches by designing new speculative cascading techniques that implement their deferral rule through speculative execution. We characterize the optimal deferral rule for our speculative cascades, and employ a plug-in approximation to the optimal rule. Through experiments with T5 models on benchmark language tasks, we show that the proposed approach yields better cost-quality trade-offs than cascading and speculative decoding baselines.
Related papers
- Graph-Structured Speculative Decoding [52.94367724136063]
Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models.
We introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses.
We observe a remarkable speedup of 1.73$times$ to 1.96$times$, significantly surpassing standard speculative decoding.
arXiv Detail & Related papers (2024-07-23T06:21:24Z) - Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models [63.36637269634553]
We present a novel method of further improving performance by requiring models to compare multiple reasoning chains.
We find that instruction tuning on DCoT datasets boosts the performance of even smaller, and therefore more accessible, language models.
arXiv Detail & Related papers (2024-07-03T15:01:18Z) - One-Shot Safety Alignment for Large Language Models via Optimal Dualization [64.52223677468861]
This paper presents a dualization perspective that reduces constrained alignment to an equivalent unconstrained alignment problem.
We do so by pre-optimizing a smooth and convex dual function that has a closed form.
Our strategy leads to two practical algorithms in model-based and preference-based scenarios.
arXiv Detail & Related papers (2024-05-29T22:12:52Z) - Recurrent Drafter for Fast Speculative Decoding in Large Language Models [18.342742904042673]
We introduce an improved approach of speculative decoding aimed at enhancing the efficiency of serving large language models.
We capitalize on the strengths of two established techniques: the classic two-model speculative decoding approach, and the more recent single-model approach, Medusa.
We empirically demonstrate the effectiveness of the proposed method on several popular open source language models, along with a comprehensive analysis of the trade-offs involved in adopting this approach.
arXiv Detail & Related papers (2024-03-14T23:40:56Z) - Stochastic Two Points Method for Deep Model Zeroth-order Optimization [32.459322001738144]
This paper introduces an efficient Two-Point (S2P) approach within the gradient-free regime.
We present the theoretical convergence properties of S2P under the general and relaxed smoothness assumptions.
We show that VS2P is highly effective in optimizing objectives for deep models.
arXiv Detail & Related papers (2024-02-02T18:39:40Z) - Cascade Speculative Drafting for Even Faster LLM Inference [25.642604897018852]
Speculative decoding improves the efficiency of large language model (LLM) inference.
We introduce Cascade Speculative Drafting (CS Drafting), a speculative execution algorithm that incorporates two types of cascades.
CS Drafting achieves up to an 81 percent additional speedup over speculative decoding in our experiments.
arXiv Detail & Related papers (2023-12-18T18:59:46Z) - SlimSeg: Slimmable Semantic Segmentation with Boundary Supervision [54.16430358203348]
We propose a simple but effective slimmable semantic segmentation (SlimSeg) method, which can be executed at different capacities during inference.
We show that our proposed SlimSeg with various mainstream networks can produce flexible models that provide dynamic adjustment of computational cost and better performance.
arXiv Detail & Related papers (2022-07-13T14:41:05Z) - Deep Variational Models for Collaborative Filtering-based Recommender
Systems [63.995130144110156]
Deep learning provides accurate collaborative filtering models to improve recommender system results.
Our proposed models apply the variational concept to injectity in the latent space of the deep architecture.
Results show the superiority of the proposed approach in scenarios where the variational enrichment exceeds the injected noise effect.
arXiv Detail & Related papers (2021-07-27T08:59:39Z) - Learning Deep-Latent Hierarchies by Stacking Wasserstein Autoencoders [22.54887526392739]
We propose a novel approach to training models with deep-latent hierarchies based on Optimal Transport.
We show that our method enables the generative model to fully leverage its deep-latent hierarchy, avoiding the well known "latent variable collapse" issue of VAEs.
arXiv Detail & Related papers (2020-10-07T15:04:20Z) - Proximal Mapping for Deep Regularization [15.48377586806766]
Underpinning the success of deep learning is effective regularizations that allow a variety of priors in data to be modeled.
We propose inserting proximal mapping as a new layer to the deep network, which directly and explicitly produces well regularized hidden layer outputs.
The resulting technique is shown well connected to kernel warping and dropout, and novel algorithms were developed for robust temporal learning and multiview modeling.
arXiv Detail & Related papers (2020-06-14T07:04:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.