Related papers: GPUTreeShap: Massively Parallel Exact Calculation of SHAP Scores for Tree Ensembles

GPUTreeShap: Massively Parallel Exact Calculation of SHAP Scores for Tree Ensembles

URL: http://arxiv.org/abs/2010.13972v3
Date: Thu, 3 Feb 2022 11:53:13 GMT
Title: GPUTreeShap: Massively Parallel Exact Calculation of SHAP Scores for Tree Ensembles
Authors: Rory Mitchell, Eibe Frank, Geoffrey Holmes
Abstract summary: We present a reformulated TreeShap algorithm suitable for massively parallel computation on graphics processing units. We achieve speedups of up to 19x for SHAP values, and speedups of up to 340x for SHAP values, over a state-of-the-art multi-core CPU implementation.
Score: 0.8057006406834467
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: SHAP (SHapley Additive exPlanation) values provide a game theoretic interpretation of the predictions of machine learning models based on Shapley values. While exact calculation of SHAP values is computationally intractable in general, a recursive polynomial-time algorithm called TreeShap is available for decision tree models. However, despite its polynomial time complexity, TreeShap can become a significant bottleneck in practical machine learning pipelines when applied to large decision tree ensembles. Unfortunately, the complicated TreeShap algorithm is difficult to map to hardware accelerators such as GPUs. In this work, we present GPUTreeShap, a reformulated TreeShap algorithm suitable for massively parallel computation on graphics processing units. Our approach first preprocesses each decision tree to isolate variable sized sub-problems from the original recursive algorithm, then solves a bin packing problem, and finally maps sub-problems to single-instruction, multiple-thread (SIMT) tasks for parallel execution with specialised hardware instructions. With a single NVIDIA Tesla V100-32 GPU, we achieve speedups of up to 19x for SHAP values, and speedups of up to 340x for SHAP interaction values, over a state-of-the-art multi-core CPU implementation executed on two 20-core Xeon E5-2698 v4 2.2 GHz CPUs. We also experiment with multi-GPU computing using eight V100 GPUs, demonstrating throughput of 1.2M rows per second -- equivalent CPU-based performance is estimated to require 6850 CPU cores.

Related papers

A Parallel CPU-GPU Framework for Cost-Bounded DFS with Applications to IDA* and BTS [13.186524200050957]
We introduce a method for GPU computations in depth first search.<n>This is used to create algorithms like emphsynchronous IDA*, an extension of the Iterative Deepening A* (IDA*) algorithm.<n>We evaluate the approach on the 3x3's Rubik Cube and 4x4 sliding tile puzzle (STP), showing that GPU operations can be efficiently batched in DFS.
arXiv Detail & Related papers (2025-07-16T05:07:33Z)
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters [10.403248386029407]
Our formulation reveals that the reduction across the sequence axis can be efficiently computed in parallel through a tree reduction. Our algorithm, called Tree Attention, for parallelizing exact attention across multiple GPUs enables cross-device decoding. We demonstrate that Tree Attention speeds up decoding up to 4x on Llama 3.1-8B and can be applied to a variety of hardware and networking setups.
arXiv Detail & Related papers (2024-08-07T21:16:55Z)
INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient. We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture. We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z)
Parallel Tree Kernel Computation [0.0]
We devise a parallel implementation of the sequential algorithm for the computation of some tree kernels of two finite sets of trees. Results show that the proposed parallel algorithm outperforms the sequential one in terms of latency.
arXiv Detail & Related papers (2023-05-12T18:16:45Z)
Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels. We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z)
PLSSVM: A (multi-)GPGPU-accelerated Least Squares Support Vector Machine [68.8204255655161]
Support Vector Machines (SVMs) are widely used in machine learning. However, even modern and optimized implementations do not scale well for large non-trivial dense data sets on cutting-edge hardware. PLSSVM can be used as a drop-in replacement for an LVM.
arXiv Detail & Related papers (2022-02-25T13:24:23Z)
Efficient and Generic 1D Dilated Convolution Layer for Deep Learning [52.899995651639436]
We introduce our efficient implementation of a generic 1D convolution layer covering a wide range of parameters. It is optimized for x86 CPU architectures, in particular, for architectures containing Intel AVX-512 and AVX-512 BFloat16 instructions. We demonstrate the performance of our optimized 1D convolution layer by utilizing it in the end-to-end neural network training with real genomics datasets.
arXiv Detail & Related papers (2021-04-16T09:54:30Z)
Systolic Computing on GPUs for Productive Performance [2.8064596842326575]
We propose a language and compiler to productively build high-performance systolic arrays that run on GPUs. A programmer it' specifies a projection of a dataflow compute onto a linear systolic array, while leaving the detailed implementation of the projection to a compiler. The compiler implements the specified projection and maps the linear systolic array to the SIMD execution units and vector registers of GPUs.
arXiv Detail & Related papers (2020-10-29T18:49:54Z)
At-Scale Sparse Deep Neural Network Inference with Efficient GPU Implementation [24.824295164938604]
This paper presents GPU performance optimization and scaling results for inference models of the Sparse Deep Neural Network Challenge 2020. Sparse deep neural networks (SpDNN) have shown promise for reining in the memory footprint of large neural networks. This work presents optimized sparse matrix multiplication kernels fused with the ReLU function.
arXiv Detail & Related papers (2020-07-28T12:09:43Z)
Heterogeneous CPU+GPU Stochastic Gradient Descent Algorithms [1.3249453757295084]
We study training algorithms for deep learning on heterogeneous CPU+GPU architectures. Our two-fold objective -- maximize convergence rate and resource utilization simultaneously -- makes the problem challenging. We show that the implementation of these algorithms achieves both faster convergence and higher resource utilization than on several real datasets.
arXiv Detail & Related papers (2020-04-19T05:21:20Z)
MPLP++: Fast, Parallel Dual Block-Coordinate Ascent for Dense Graphical Models [96.1052289276254]
This work introduces a new MAP-solver, based on the popular Dual Block-Coordinate Ascent principle. Surprisingly, by making a small change to the low-performing solver, we derive the new solver MPLP++ that significantly outperforms all existing solvers by a large margin.
arXiv Detail & Related papers (2020-04-16T16:20:53Z)
Efficient Video Semantic Segmentation with Labels Propagation and Refinement [138.55845680523908]
This paper tackles the problem of real-time semantic segmentation of high definition videos using a hybrid GPU / CPU approach. We propose an Efficient Video(EVS) pipeline that combines: (i) On the CPU, a very fast optical flow method, that is used to exploit the temporal aspect of the video and propagate semantic information from one frame to the next. On the popular Cityscapes dataset with high resolution frames (2048 x 1024), the proposed operating points range from 80 to 1000 Hz on a single GPU and CPU.
arXiv Detail & Related papers (2019-12-26T11:45:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.