Related papers: Enabling On-Device Smartphone GPU based Training: Lessons Learned

Enabling On-Device Smartphone GPU based Training: Lessons Learned

URL: http://arxiv.org/abs/2202.10100v1
Date: Mon, 21 Feb 2022 10:29:16 GMT
Title: Enabling On-Device Smartphone GPU based Training: Lessons Learned
Authors: Anish Das and Young D. Kwon and Jagmohan Chauhan and Cecilia Mascolo
Abstract summary: We conduct an initial analysis to examine the feasibility of on-device training on smartphones using mobile GPUs. To solve the bottleneck, we optimize the OpenCL backend's kernels, showing 2x improvements (40-70 GFLOPs) over CPUs. The data movement takes almost 91% of training time due to the low bandwidth.
Score: 10.420617367363047
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep Learning (DL) has shown impressive performance in many mobile applications. Most existing works have focused on reducing the computational and resource overheads of running Deep Neural Networks (DNN) inference on resource-constrained mobile devices. However, the other aspect of DNN operations, i.e. training (forward and backward passes) on smartphone GPUs, has received little attention thus far. To this end, we conduct an initial analysis to examine the feasibility of on-device training on smartphones using mobile GPUs. We first employ the open-source mobile DL framework (MNN) and its OpenCL backend for running compute kernels on GPUs. Next, we observed that training on CPUs is much faster than on GPUs and identified two possible bottlenecks related to this observation: (i) computation and (ii) memory bottlenecks. To solve the computation bottleneck, we optimize the OpenCL backend's kernels, showing 2x improvements (40-70 GFLOPs) over CPUs (15-30 GFLOPs) on the Snapdragon 8 series processors. However, we find that the full DNN training is still much slower on GPUs than on CPUs, indicating that memory bottleneck plays a significant role in the lower performance of GPU over CPU. The data movement takes almost 91% of training time due to the low bandwidth. Lastly, based on the findings and failures during our investigation, we present limitations and practical guidelines for future directions.

Related papers

Challenging GPU Dominance: When CPUs Outperform for On-Device LLM Inference [6.829272097221596]
We show that a CPU-only configuration achieves 17 tokens per second, surpassing the 12.8 tokens per second obtained with GPU acceleration.<n>We analyze the factors driving this counterintuitive result, revealing that GPU memory transfer overhead and CPU thread optimization play a critical role.<n>Our findings challenge conventional GPU-first thinking, highlighting the untapped potential of optimized CPU inference.
arXiv Detail & Related papers (2025-05-09T23:05:53Z)
Less Memory Means smaller GPUs: Backpropagation with Compressed Activations [1.7065506903618906]
The ever-growing scale of deep neural networks (DNNs) has lead to an equally rapid growth in computational resource requirements. Many recent architectures, most prominently Large Language Models, have to be trained using supercomputers with thousands of accelerators. With this approach we are able to reduce the peak memory consumption by 29% at the cost of a longer training schedule.
arXiv Detail & Related papers (2024-09-18T11:57:05Z)
Comparative Analysis of CPU and GPU Profiling for Deep Learning Models [0.0]
This paper presents the time and memory allocation of CPU and GPU while training deep neural networks using Pytorch. For a simpler network, there are not many significant improvements in GPU over the CPU.
arXiv Detail & Related papers (2023-09-05T18:22:11Z)
Deep Learning Models on CPUs: A Methodology for Efficient Training [1.7150798380270715]
This paper makes several contributions to research on training deep learning models using CPUs. It presents a method for optimizing the training of deep learning models on Intel CPUs and a toolkit called ProfileDNN.
arXiv Detail & Related papers (2022-06-20T22:42:14Z)
Dynamic Split Computing for Efficient Deep Edge Intelligence [78.4233915447056]
We introduce dynamic split computing, where the optimal split location is dynamically selected based on the state of the communication channel. We show that dynamic split computing achieves faster inference in edge computing environments where the data rate and server load vary over time.
arXiv Detail & Related papers (2022-05-23T12:35:18Z)
PLSSVM: A (multi-)GPGPU-accelerated Least Squares Support Vector Machine [68.8204255655161]
Support Vector Machines (SVMs) are widely used in machine learning. However, even modern and optimized implementations do not scale well for large non-trivial dense data sets on cutting-edge hardware. PLSSVM can be used as a drop-in replacement for an LVM.
arXiv Detail & Related papers (2022-02-25T13:24:23Z)
Efficient and Generic 1D Dilated Convolution Layer for Deep Learning [52.899995651639436]
We introduce our efficient implementation of a generic 1D convolution layer covering a wide range of parameters. It is optimized for x86 CPU architectures, in particular, for architectures containing Intel AVX-512 and AVX-512 BFloat16 instructions. We demonstrate the performance of our optimized 1D convolution layer by utilizing it in the end-to-end neural network training with real genomics datasets.
arXiv Detail & Related papers (2021-04-16T09:54:30Z)
DistGNN: Scalable Distributed Training for Large-Scale Graph Neural Networks [58.48833325238537]
Full-batch training on Graph Neural Networks (GNN) to learn the structure of large graphs is a critical problem that needs to scale to hundreds of compute nodes to be feasible. In this paper, we presentGNN that optimize the well-known Deep Graph Library (DGL) for full-batch training on CPU clusters. Our results on four common GNN benchmark datasets show up to 3.7x speed-up using a single CPU socket and up to 97x speed-up using 128 CPU sockets.
arXiv Detail & Related papers (2021-04-14T08:46:35Z)
L2PF -- Learning to Prune Faster [57.32153461504626]
We present a multi-task, try-and-learn method, discretely learning redundant filters of the CNN and a continuous action of how long the layers have to be fine-tuned. For ResNet20, we have achieved a compression ratio of 3.84 x with minimal accuracy degradation. Compared to the state-of-the-art pruning method, we reduced the GPU hours by 1.71 x.
arXiv Detail & Related papers (2021-01-07T18:13:37Z)
At-Scale Sparse Deep Neural Network Inference with Efficient GPU Implementation [24.824295164938604]
This paper presents GPU performance optimization and scaling results for inference models of the Sparse Deep Neural Network Challenge 2020. Sparse deep neural networks (SpDNN) have shown promise for reining in the memory footprint of large neural networks. This work presents optimized sparse matrix multiplication kernels fused with the ReLU function.
arXiv Detail & Related papers (2020-07-28T12:09:43Z)
RT3D: Achieving Real-Time Execution of 3D Convolutional Neural Networks on Mobile Devices [57.877112704841366]
This paper proposes RT3D, a model compression and mobile acceleration framework for 3D CNNs. For the first time, real-time execution of 3D CNNs is achieved on off-the-shelf mobiles.
arXiv Detail & Related papers (2020-07-20T02:05:32Z)
Hybrid Models for Learning to Branch [81.93868699246214]
We propose a new hybrid architecture for efficient branching on CPU machines. The proposed architecture combines the expressive power of GNNs with computationally inexpensive multi-layer perceptrons (MLP) for branching.
arXiv Detail & Related papers (2020-06-26T21:03:45Z)
TFApprox: Towards a Fast Emulation of DNN Approximate Hardware Accelerators on GPU [0.4817429789586127]
Energy efficiency of hardware accelerators of deep neural networks (DNN) can be improved by introducing approximate arithmetic circuits. A software emulation of the DNN accelerator is usually executed on CPU or GPU. This emulation is typically two or three orders of magnitude slower than a software DNN implementation running on or emulated.
arXiv Detail & Related papers (2020-02-21T08:22:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.