Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO
- URL: http://arxiv.org/abs/2311.04951v2
- Date: Tue, 9 Apr 2024 11:21:14 GMT
- Title: Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO
- Authors: Haim Barad, Ekaterina Aidova, Yury Gorbachev,
- Abstract summary: Inference optimizations are critical for improving user experience and reducing infrastructure costs and power consumption.
In this article, we illustrate a form of dynamic execution known as speculative sampling to reduce the overall latency of text generation.
- Score: 0.6144680854063939
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inference optimizations are critical for improving user experience and reducing infrastructure costs and power consumption. In this article, we illustrate a form of dynamic execution known as speculative sampling to reduce the overall latency of text generation and compare it with standard autoregressive sampling. This can be used together with model-based optimizations (e.g. quantization) to provide an optimized solution. Both sampling methods make use of KV caching. A Jupyter notebook and some sample executions are provided.
Related papers
- Preference Optimization with Multi-Sample Comparisons [53.02717574375549]
We introduce a novel approach that extends post-training to include multi-sample comparisons.
These approaches fail to capture critical characteristics such as generative diversity and bias.
We demonstrate that multi-sample comparison is more effective in optimizing collective characteristics than single-sample comparison.
arXiv Detail & Related papers (2024-10-16T00:59:19Z) - Adaptive Selection of Sampling-Reconstruction in Fourier Compressed Sensing [13.775902519100075]
Compressed sensing (CS) has emerged to overcome the inefficiency of Nyquist sampling.
Deep learning-based reconstruction has been a promising alternative to optimization-based reconstruction.
arXiv Detail & Related papers (2024-09-18T06:51:29Z) - Implicit Diffusion: Efficient Optimization through Stochastic Sampling [46.049117719591635]
We present a new algorithm to optimize distributions defined implicitly by parameterized diffusions.
We introduce a general framework for first-order optimization of these processes, that performs jointly.
We apply it to training energy-based models and finetuning denoising diffusions.
arXiv Detail & Related papers (2024-02-08T08:00:11Z) - Sample as You Infer: Predictive Coding With Langevin Dynamics [11.515490109360012]
We present a novel algorithm for parameter learning in generic deep generative models.
Our approach modifies the standard PC algorithm to bring performance on-par and exceeding that obtained from standard variational auto-encoder training.
arXiv Detail & Related papers (2023-11-22T19:36:47Z) - Optimal Budgeted Rejection Sampling for Generative Models [54.050498411883495]
Rejection sampling methods have been proposed to improve the performance of discriminator-based generative models.
We first propose an Optimal Budgeted Rejection Sampling scheme that is provably optimal.
Second, we propose an end-to-end method that incorporates the sampling scheme into the training procedure to further enhance the model's overall performance.
arXiv Detail & Related papers (2023-11-01T11:52:41Z) - FuzzyFlow: Leveraging Dataflow To Find and Squash Program Optimization
Bugs [92.47146416628965]
FuzzyFlow is a fault localization and test case extraction framework designed to test program optimizations.
We leverage dataflow program representations to capture a fully reproducible system state and area-of-effect for optimizations.
To reduce testing time, we design an algorithm for minimizing test inputs, trading off memory for recomputation.
arXiv Detail & Related papers (2023-06-28T13:00:17Z) - Conditional Denoising Diffusion for Sequential Recommendation [62.127862728308045]
Two prominent generative models, Generative Adversarial Networks (GANs) and Variational AutoEncoders (VAEs)
GANs suffer from unstable optimization, while VAEs are prone to posterior collapse and over-smoothed generations.
We present a conditional denoising diffusion model, which includes a sequence encoder, a cross-attentive denoising decoder, and a step-wise diffuser.
arXiv Detail & Related papers (2023-04-22T15:32:59Z) - Fast Variational AutoEncoder with Inverted Multi-Index for Collaborative
Filtering [59.349057602266]
Variational AutoEncoder (VAE) has been extended as a representative nonlinear method for collaborative filtering.
We propose to decompose the inner-product-based softmax probability based on the inverted multi-index.
FastVAE can outperform the state-of-the-art baselines in terms of both sampling quality and efficiency.
arXiv Detail & Related papers (2021-09-13T08:31:59Z) - A Constant-time Adaptive Negative Sampling [33.585006286223994]
We show a class of distribution where the sampling scheme is truly adaptive and provably generates negative samples in constant time.
Our implementation in C++ on commodity CPU is significantly faster, in terms of wall clock time.
arXiv Detail & Related papers (2020-12-31T18:56:41Z) - BOSH: Bayesian Optimization by Sampling Hierarchically [10.10241176664951]
We propose a novel BO routine pairing a hierarchical Gaussian process with an information-theoretic framework to generate a growing pool of realizations.
We demonstrate that BOSH provides more efficient and higher-precision optimization than standard BO across synthetic benchmarks, simulation optimization, reinforcement learning and hyper- parameter tuning tasks.
arXiv Detail & Related papers (2020-07-02T07:35:49Z) - Robust Sampling in Deep Learning [62.997667081978825]
Deep learning requires regularization mechanisms to reduce overfitting and improve generalization.
We address this problem by a new regularization method based on distributional robust optimization.
During the training, the selection of samples is done according to their accuracy in such a way that the worst performed samples are the ones that contribute the most in the optimization.
arXiv Detail & Related papers (2020-06-04T09:46:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.