Understanding Cache Boundness of ML Operators on ARM Processors
- URL: http://arxiv.org/abs/2102.00932v1
- Date: Mon, 1 Feb 2021 16:05:50 GMT
- Title: Understanding Cache Boundness of ML Operators on ARM Processors
- Authors: Bernhard Klein and Christoph Gratl and Manfred M\"ucke and Holger
Fr\"oning
- Abstract summary: This is the first in-detail analysis of dense and convolution operators, generated with TVM, that compares to the fundamental hardware limits of embedded ARM processors.
One can see that single-precision general matrix multiply (GEMM) and convolutions are bound by L1-cache-read bandwidth.
Explorations of 8-bit and bit-serial quantized operators show that quantization can be used to achieve relevant speedups.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine Learning compilers like TVM allow a fast and flexible deployment on
embedded CPUs. This enables the use of non-standard operators, which are common
in ML compression techniques. However, it is necessary to understand the
limitations of typical compute-intense operators in ML workloads to design a
proper solution. This is the first in-detail analysis of dense and convolution
operators, generated with TVM, that compares to the fundamental hardware limits
of embedded ARM processors. Thereby it explains the gap between computational
peak performance, theoretical and measured, and real-world state-of-the-art
results, created with TVM and openBLAS. Instead, one can see that
single-precision general matrix multiply (GEMM) and convolutions are bound by
L1-cache-read bandwidth. Explorations of 8-bit and bit-serial quantized
operators show that quantization can be used to achieve relevant speedups
compared to cache-bound floating-point operators. However, the performance of
quantized operators highly depends on the interaction between data layout and
bit packing.
Related papers
- Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on Arm CPUs [0.8217552831952]
Large language models (LLMs) have transformed the way we think about language understanding and generation.
Group quantization formats commonly used for LLM quantization have significant compute overheads and a resource-intensive dequantization process.
We present a groupwise non-uniform codebook-based quantization method for ultra-low-precision quantization of LLMs to better match non-uniform patterns in their weight distributions.
arXiv Detail & Related papers (2024-12-23T03:44:29Z) - Fast Matrix Multiplications for Lookup Table-Quantized LLMs [58.11584672945781]
FLUTE is a flexible lookup table engine for LUT-quantized LLMs.
At batch sizes 32 and quantization group size of 128, the FLUTE kernel can be 2-4x faster than existing GEMM kernels.
arXiv Detail & Related papers (2024-07-15T17:55:42Z) - Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization [0.6445087473595953]
Large language models (LLMs) demonstrate outstanding performance in various tasks in machine learning.
deploying LLM inference poses challenges due to the high compute and memory requirements.
We present Tender, an algorithm-hardware co-design solution that enables efficient deployment of LLM inference at low precision.
arXiv Detail & Related papers (2024-06-16T09:51:55Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - Pex: Memory-efficient Microcontroller Deep Learning through Partial
Execution [11.336229510791481]
We discuss a novel execution paradigm for microcontroller deep learning.
It modifies the execution of neural networks to avoid materialising full buffers in memory.
This is achieved by exploiting the properties of operators, which can consume/produce a fraction of their input/output at a time.
arXiv Detail & Related papers (2022-11-30T18:47:30Z) - LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale [80.86029795281922]
We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers.
A 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation.
arXiv Detail & Related papers (2022-08-15T17:08:50Z) - Algorithm to Compilation Co-design: An Integrated View of Neural Network
Sparsity [0.8566457170664925]
We apply structured and unstructured pruning to attention weights of transformer blocks of the BERT language model.
We study relationships between modeling decisions and their direct impact on sparsity-enhanced execution.
arXiv Detail & Related papers (2021-06-16T15:13:26Z) - Providing Meaningful Data Summarizations Using Examplar-based Clustering
in Industry 4.0 [67.80123919697971]
We show, that our GPU implementation provides speedups of up to 72x using single-precision and up to 452x using half-precision compared to conventional CPU algorithms.
We apply our algorithm to real-world data from injection molding manufacturing processes and discuss how found summaries help with steering this specific process to cut costs and reduce the manufacturing of bad parts.
arXiv Detail & Related papers (2021-05-25T15:55:14Z) - A Tensor Compiler for Unified Machine Learning Prediction Serving [8.362773007171118]
Machine Learning (ML) adoption in the enterprise requires simpler and more efficient software infrastructure.
Model scoring is a primary contributor to infrastructure complexity and cost as models are trained once but used many times.
We propose HUMMINGBIRD, a novel approach to model scoring that compiles featurization operators and traditional ML models into a small set of tensor operations.
arXiv Detail & Related papers (2020-10-09T21:02:47Z) - Straggler-aware Distributed Learning: Communication Computation Latency
Trade-off [56.08535873173518]
Straggling workers can be tolerated by assigning redundant computations and coding across data and computations.
In most existing schemes, each non-straggling worker transmits one message per iteration to the parameter server (PS) after completing all its computations.
Imposing such a limitation results in two main drawbacks; over-computation due to inaccurate prediction of the straggling behaviour, and under-utilization due to treating workers as straggler/non-straggler.
arXiv Detail & Related papers (2020-04-10T08:39:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.