Related papers: Optimized Speculative Sampling for GPU Hardware Accelerators

Optimized Speculative Sampling for GPU Hardware Accelerators

URL: http://arxiv.org/abs/2406.11016v2
Date: Thu, 03 Oct 2024 08:05:14 GMT
Title: Optimized Speculative Sampling for GPU Hardware Accelerators
Authors: Dominik Wagner, Seanie Lee, Ilja Baumann, Philipp Seeberger, Korbinian Riedhammer, Tobias Bocklet,
Abstract summary: We optimize speculative sampling for parallel hardware accelerators to improve sampling speed. We distribute the workload across multiple GPU threads, enabling simultaneous operations on matrix segments within thread blocks. We conduct extensive experiments on both automatic speech recognition and summarization tasks to validate our methods.
Score: 14.681982904792763
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this work, we optimize speculative sampling for parallel hardware accelerators to improve sampling speed. We notice that substantial portions of the intermediate matrices necessary for speculative sampling can be computed concurrently. This allows us to distribute the workload across multiple GPU threads, enabling simultaneous operations on matrix segments within thread blocks. This results in profiling time improvements ranging from 6% to 13% relative to the baseline implementation, without compromising accuracy. To further accelerate speculative sampling, probability distributions parameterized by softmax are approximated by sigmoid. This approximation approach results in significantly greater relative improvements in profiling time, ranging from 37% to 94%, with a minor decline in accuracy. We conduct extensive experiments on both automatic speech recognition and summarization tasks to validate the effectiveness of our optimization methods.

Related papers

3DGS-Calib: 3D Gaussian Splatting for Multimodal SpatioTemporal Calibration [9.825752747213297]
We introduce 3DGS-Calitemporal, a new calibration method that relies on speed and rendering accuracy of 3D Gaussian Splatting representations. We demonstrate the superiority of our proposal with experimental results on sequences from a widely used driving dataset.
arXiv Detail & Related papers (2024-03-18T08:53:03Z)
On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks [52.97107229149988]
We propose an On-Chip Hardware-Aware Quantization framework, performing hardware-aware mixed-precision quantization on deployed edge devices. For efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator. For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario.
arXiv Detail & Related papers (2023-09-05T04:39:34Z)
Free Bits: Latency Optimization of Mixed-Precision Quantized Neural Networks on the Edge [17.277918711842457]
Mixed-precision quantization offers the opportunity to optimize the trade-offs between model size, latency, and statistical accuracy. This paper proposes a hybrid search methodology to navigate the search space of mixed-precision configurations for a given network. It consists of a hardware-agnostic differentiable search algorithm followed by a hardware-aware optimization to find mixed-precision configurations latency-optimized for a specific hardware target.
arXiv Detail & Related papers (2023-07-06T09:57:48Z)
Design and Prototyping Distributed CNN Inference Acceleration in Edge Computing [85.74517957717363]
HALP accelerates inference by designing a seamless collaboration among edge devices (EDs) in Edge Computing. Experiments show that the distributed inference HALP achieves 1.7x inference acceleration for VGG-16. It is shown that the model selection with distributed inference HALP can significantly improve service reliability.
arXiv Detail & Related papers (2022-11-24T19:48:30Z)
Fast Variational AutoEncoder with Inverted Multi-Index for Collaborative Filtering [59.349057602266]
Variational AutoEncoder (VAE) has been extended as a representative nonlinear method for collaborative filtering. We propose to decompose the inner-product-based softmax probability based on the inverted multi-index. FastVAE can outperform the state-of-the-art baselines in terms of both sampling quality and efficiency.
arXiv Detail & Related papers (2021-09-13T08:31:59Z)
Providing Meaningful Data Summarizations Using Examplar-based Clustering in Industry 4.0 [67.80123919697971]
We show, that our GPU implementation provides speedups of up to 72x using single-precision and up to 452x using half-precision compared to conventional CPU algorithms. We apply our algorithm to real-world data from injection molding manufacturing processes and discuss how found summaries help with steering this specific process to cut costs and reduce the manufacturing of bad parts.
arXiv Detail & Related papers (2021-05-25T15:55:14Z)
Sample and Computation Redistribution for Efficient Face Detection [137.19388513633484]
Training data sampling and computation distribution strategies are the keys to efficient and accurate face detection. scrfdf34 outperforms the best competitor, TinaFace, by $3.86%$ (AP at hard set) while being more than emph3$times$ faster on GPUs with VGA-resolution images.
arXiv Detail & Related papers (2021-05-10T23:51:14Z)
Stochastic Optimization with Laggard Data Pipelines [65.20044914532221]
We show that "dataechoed" extensions of common optimization methods exhibit provable improvements over their synchronous counterparts. Specifically, we show that in convex optimization with minibatches, data echoing affords speedups on the curvature-dominated part of the convergence rate, while maintaining the optimal statistical rate.
arXiv Detail & Related papers (2020-10-26T14:55:31Z)
FastForest: Increasing Random Forest Processing Speed While Maintaining Accuracy [2.6118176084782836]
Our proposed FastForest algorithm delivers an average 24% increase in processing speed compared with Random Forest. It maintains (and frequently exceeding) it on classification accuracy over tests involving 45 datasets. detailed testing of Subbagging sizes has found an optimal scalar delivering a positive mix of processing performance and accuracy.
arXiv Detail & Related papers (2020-04-06T06:37:03Z)
Scalable Hyperparameter Optimization with Lazy Gaussian Processes [1.3999481573773074]
We present a novel, highly accurate approximation of the underlying Gaussian Process. The first experiments show speedups of a factor of 162 in single node and further speed up by a factor of 5 in a parallel environment.
arXiv Detail & Related papers (2020-01-16T10:15:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.