PLSSVM: A (multi-)GPGPU-accelerated Least Squares Support Vector Machine
- URL: http://arxiv.org/abs/2202.12674v1
- Date: Fri, 25 Feb 2022 13:24:23 GMT
- Title: PLSSVM: A (multi-)GPGPU-accelerated Least Squares Support Vector Machine
- Authors: Alexander Van Craen and Marcel Breyer and Dirk Pfl\"uger
- Abstract summary: Support Vector Machines (SVMs) are widely used in machine learning.
However, even modern and optimized implementations do not scale well for large non-trivial dense data sets on cutting-edge hardware.
PLSSVM can be used as a drop-in replacement for an LVM.
- Score: 68.8204255655161
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine learning algorithms must be able to efficiently cope with massive
data sets. Therefore, they have to scale well on any modern system and be able
to exploit the computing power of accelerators independent of their vendor. In
the field of supervised learning, Support Vector Machines (SVMs) are widely
used. However, even modern and optimized implementations such as LIBSVM or
ThunderSVM do not scale well for large non-trivial dense data sets on
cutting-edge hardware: Most SVM implementations are based on Sequential Minimal
Optimization, an optimized though inherent sequential algorithm. Hence, they
are not well-suited for highly parallel GPUs. Furthermore, we are not aware of
a performance portable implementation that supports CPUs and GPUs from
different vendors.
We have developed the PLSSVM library to solve both issues. First, we resort
to the formulation of the SVM as a least squares problem. Training an SVM then
boils down to solving a system of linear equations for which highly parallel
algorithms are known. Second, we provide a hardware independent yet efficient
implementation: PLSSVM uses different interchangeable backends--OpenMP, CUDA,
OpenCL, SYCL--supporting modern hardware from various vendors like NVIDIA, AMD,
or Intel on multiple GPUs. PLSSVM can be used as a drop-in replacement for
LIBSVM. We observe a speedup on CPUs of up to 10 compared to LIBSVM and on GPUs
of up to 14 compared to ThunderSVM. Our implementation scales on many-core CPUs
with a parallel speedup of 74.7 on up to 256 CPU threads and on multiple GPUs
with a parallel speedup of 3.71 on four GPUs.
The code, utility scripts, and the documentation are available on GitHub:
https://github.com/SC-SGS/PLSSVM.
Related papers
- Support Vector Machine Implementation on MPI-CUDA and Tensorflow
Framework [0.0]
Support Vector Machine (SVM) algorithm requires a high computational cost to solve a complex quadratic programming (QP) optimization problem.
parallel multi-architecture, available in both multi-core CPUs and highly scalable GPU, emerges as a promising solution to enhance algorithm performance.
This paper achieves a comparative study that implements the SVM algorithm on different parallel architecture frameworks.
arXiv Detail & Related papers (2023-11-25T02:52:37Z) - INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order
Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient.
We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture.
We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - FlexGen: High-Throughput Generative Inference of Large Language Models
with a Single GPU [89.2451963569343]
FlexGen is a generation engine for running large language model (LLM) inference on a single commodity GPU.
When running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems.
On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours.
arXiv Detail & Related papers (2023-03-13T05:19:28Z) - Optimizing Data Collection in Deep Reinforcement Learning [4.9709347068704455]
GPU vectorization can achieve up to $1024times$ speedup over commonly used CPU simulators.
We show that simulator kernel fusion speedups with a simple simulator are $11.3times$ and increase by up to $1024times$ as simulator complexity increases in terms of memory bandwidth requirements.
arXiv Detail & Related papers (2022-07-15T20:22:31Z) - Recipe for Fast Large-scale SVM Training: Polishing, Parallelism, and
more RAM! [0.0]
Support vector machines (SVMs) are a standard method in the machine learning toolbox.
Non-linear kernel SVMs often deliver highly accurate predictors, however, at the cost of long training times.
In this work, we combine both approaches to design an extremely fast dual SVM solver.
arXiv Detail & Related papers (2022-07-03T11:51:41Z) - Efficient and Generic 1D Dilated Convolution Layer for Deep Learning [52.899995651639436]
We introduce our efficient implementation of a generic 1D convolution layer covering a wide range of parameters.
It is optimized for x86 CPU architectures, in particular, for architectures containing Intel AVX-512 and AVX-512 BFloat16 instructions.
We demonstrate the performance of our optimized 1D convolution layer by utilizing it in the end-to-end neural network training with real genomics datasets.
arXiv Detail & Related papers (2021-04-16T09:54:30Z) - Accelerating SLIDE Deep Learning on Modern CPUs: Vectorization,
Quantizations, Memory Optimizations, and More [26.748770505062378]
SLIDE is a C++ implementation of a sparse hash table based back-propagation.
We show how SLIDE's computations allow for a unique possibility of vectorization via AVX (Advanced Vector Extensions-512)
Our experiments are focused on large (hundreds of millions of parameters) recommendation and NLP models.
arXiv Detail & Related papers (2021-03-06T02:13:43Z) - GPU-Accelerated Primal Learning for Extremely Fast Large-Scale
Classification [10.66048003460524]
One of the most efficient methods to solve L2-regularized primal problems, such as logistic regression and linear support vector machine (SVM) classification, is the widely used trust region Newton algorithm, TRON.
We show that using judicious GPU-optimization principles, TRON training time for different losses and feature representations may be drastically reduced.
arXiv Detail & Related papers (2020-08-08T03:40:27Z) - Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems.
Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections.
Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.