Fast Inference of Tree Ensembles on ARM Devices
- URL: http://arxiv.org/abs/2305.08579v1
- Date: Mon, 15 May 2023 12:05:03 GMT
- Title: Fast Inference of Tree Ensembles on ARM Devices
- Authors: Simon Koschel, Sebastian Buschj\"ager, Claudio Lucchese, Katharina
Morik
- Abstract summary: We convert the popular QuickScorer algorithm and its siblings from Intel's AVX to ARM's NEON instruction set.
Third, we investigate the effects of using fixed-point quantization in Random Forests.
- Score: 6.995377781193234
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the ongoing integration of Machine Learning models into everyday life,
e.g. in the form of the Internet of Things (IoT), the evaluation of learned
models becomes more and more an important issue. Tree ensembles are one of the
best black-box classifiers available and routinely outperform more complex
classifiers. While the fast application of tree ensembles has already been
studied in the literature for Intel CPUs, they have not yet been studied in the
context of ARM CPUs which are more dominant for IoT applications. In this
paper, we convert the popular QuickScorer algorithm and its siblings from
Intel's AVX to ARM's NEON instruction set. Second, we extend our implementation
from ranking models to classification models such as Random Forests. Third, we
investigate the effects of using fixed-point quantization in Random Forests.
Our study shows that a careful implementation of tree traversal on ARM CPUs
leads to a speed-up of up to 9.4 compared to a reference implementation.
Moreover, quantized models seem to outperform models using floating-point
values in terms of speed in almost all cases, with a neglectable impact on the
predictive performance of the model. Finally, our study highlights
architectural differences between ARM and Intel CPUs and between different ARM
devices that imply that the best implementation depends on both the specific
forest as well as the specific device used for deployment.
Related papers
- LowFormer: Hardware Efficient Design for Convolutional Transformer Backbones [10.435069781620957]
Research in efficient vision backbones is evolving into models that are a mixture of convolutions and transformer blocks.
We analyze common modules and architectural design choices for backbones not in terms of MACs, but rather in actual throughput and latency.
We combine both macro and micro design to create a new family of hardware-efficient backbone networks called LowFormer.
arXiv Detail & Related papers (2024-09-05T12:18:32Z) - Register Your Forests: Decision Tree Ensemble Optimization by Explicit CPU Register Allocation [3.737361598712633]
We present a code generation approach for decision tree ensembles, which produces machine assembly code within a single conversion step.
The results show that the performance of decision tree ensemble inference can be significantly improved.
arXiv Detail & Related papers (2024-04-10T09:17:22Z) - Grassroots Operator Search for Model Edge Adaptation [2.1756721838833797]
Hardware-aware Neural Architecture (HW-NAS) is increasingly being used to design efficient deep learning architectures.
We present a Grassroots Operator Search (GOS) methodology to search for efficient operator replacement.
Our method consistently outperforms the original models on two edge devices, with a minimum of 2.2x speedup while maintaining high accuracy.
arXiv Detail & Related papers (2023-09-20T12:15:58Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - Towards a learning-based performance modeling for accelerating Deep
Neural Networks [1.1549572298362785]
We start an investigation of predictive models based on machine learning techniques in order to optimize Convolution Neural Networks (CNNs)
Preliminary experiments on Midgard-based ARM Mali GPU show that our predictive model outperforms all the convolution operators manually selected by the library.
arXiv Detail & Related papers (2022-12-09T18:28:07Z) - Accelerating Deep Learning Model Inference on Arm CPUs with Ultra-Low
Bit Quantization and Runtime [57.5143536744084]
High performance of deep learning models comes at the expense of high computational, storage and power requirements.
We introduce Deeplite Neutrino for production-ready optimization of the models and Deeplite for deployment of ultra-low bit quantized models on Arm-based platforms.
arXiv Detail & Related papers (2022-07-18T15:05:17Z) - Optimization of Decision Tree Evaluation Using SIMD Instructions [0.0]
We explore MatrixNet, the ancestor of the popular CatBoost library.
This paper investigates the opportunities given by the AVX instruction set to evaluate models more efficiently.
arXiv Detail & Related papers (2022-05-15T15:12:40Z) - SRU++: Pioneering Fast Recurrence with Attention for Speech Recognition [49.42625022146008]
We present the advantages of applying SRU++ in ASR tasks by comparing with Conformer across multiple ASR benchmarks.
Specifically, SRU++ can surpass Conformer on long-form speech input with a large margin, based on our analysis.
arXiv Detail & Related papers (2021-10-11T19:23:50Z) - ARMS: Antithetic-REINFORCE-Multi-Sample Gradient for Binary Variables [60.799183326613395]
Antithetic REINFORCE-based Multi-Sample gradient estimator.
ARMS uses a copula to generate any number of mutually antithetic samples.
We evaluate ARMS on several datasets for training generative models, and our experimental results show that it outperforms competing methods.
arXiv Detail & Related papers (2021-05-28T23:19:54Z) - ANNETTE: Accurate Neural Network Execution Time Estimation with Stacked
Models [56.21470608621633]
We propose a time estimation framework to decouple the architectural search from the target hardware.
The proposed methodology extracts a set of models from micro- kernel and multi-layer benchmarks and generates a stacked model for mapping and network execution time estimation.
We compare estimation accuracy and fidelity of the generated mixed models, statistical models with the roofline model, and a refined roofline model for evaluation.
arXiv Detail & Related papers (2021-05-07T11:39:05Z) - Real-Time Execution of Large-scale Language Models on Mobile [49.32610509282623]
We find the best model structure of BERT for a given computation size to match specific devices.
Our framework can guarantee the identified model to meet both resource and real-time specifications of mobile devices.
Specifically, our model is 5.2x faster on CPU and 4.1x faster on GPU with 0.5-2% accuracy loss compared with BERT-base.
arXiv Detail & Related papers (2020-09-15T01:59:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.