NeuroFlux: Memory-Efficient CNN Training Using Adaptive Local Learning
- URL: http://arxiv.org/abs/2402.14139v2
- Date: Mon, 4 Mar 2024 17:47:32 GMT
- Title: NeuroFlux: Memory-Efficient CNN Training Using Adaptive Local Learning
- Authors: Dhananjay Saikumar and Blesson Varghese
- Abstract summary: Convolutional Neural Network (CNN) training in resource-constrained mobile and edge environments is an open challenge.
Backpropagation is the standard approach adopted, but it is GPU memory intensive due to its strong inter-layer dependencies.
We introduce NeuroFlux, a novel CNN training system tailored for memory-constrained scenarios.
- Score: 2.61072980439312
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Efficient on-device Convolutional Neural Network (CNN) training in
resource-constrained mobile and edge environments is an open challenge.
Backpropagation is the standard approach adopted, but it is GPU memory
intensive due to its strong inter-layer dependencies that demand intermediate
activations across the entire CNN model to be retained in GPU memory. This
necessitates smaller batch sizes to make training possible within the available
GPU memory budget, but in turn, results in substantially high and impractical
training time. We introduce NeuroFlux, a novel CNN training system tailored for
memory-constrained scenarios. We develop two novel opportunities: firstly,
adaptive auxiliary networks that employ a variable number of filters to reduce
GPU memory usage, and secondly, block-specific adaptive batch sizes, which not
only cater to the GPU memory constraints but also accelerate the training
process. NeuroFlux segments a CNN into blocks based on GPU memory usage and
further attaches an auxiliary network to each layer in these blocks. This
disrupts the typical layer dependencies under a new training paradigm -
$\textit{`adaptive local learning'}$. Moreover, NeuroFlux adeptly caches
intermediate activations, eliminating redundant forward passes over previously
trained blocks, further accelerating the training process. The results are
twofold when compared to Backpropagation: on various hardware platforms,
NeuroFlux demonstrates training speed-ups of 2.3$\times$ to 6.1$\times$ under
stringent GPU memory budgets, and NeuroFlux generates streamlined models that
have 10.9$\times$ to 29.4$\times$ fewer parameters.
Related papers
- Distributed Convolutional Neural Network Training on Mobile and Edge Clusters [0.9421843976231371]
Recent efforts have emerged to localize machine learning tasks fully on the edge.
This brings advantages in reduced latency and increased privacy, but necessitates working with resource-constrained devices.
We describe an approach for distributed CNN training exclusively on mobile and edge devices.
arXiv Detail & Related papers (2024-09-11T02:44:28Z) - Communication-Free Distributed GNN Training with Vertex Cut [63.22674903170953]
CoFree-GNN is a novel distributed GNN training framework that significantly speeds up the training process by implementing communication-free training.
We demonstrate that CoFree-GNN speeds up the GNN training process by up to 10 times over the existing state-of-the-art GNN training approaches.
arXiv Detail & Related papers (2023-08-06T21:04:58Z) - DistTGL: Distributed Memory-Based Temporal Graph Neural Network Training [18.52206409432894]
DistTGL is an efficient and scalable solution to train memory-based TGNNs on distributed GPU clusters.
In experiments, DistTGL achieves near-linear convergence speedup, outperforming state-of-the-art single-machine method by 14.5% in accuracy and 10.17x in training throughput.
arXiv Detail & Related papers (2023-07-14T22:52:27Z) - GLEAM: Greedy Learning for Large-Scale Accelerated MRI Reconstruction [50.248694764703714]
Unrolled neural networks have recently achieved state-of-the-art accelerated MRI reconstruction.
These networks unroll iterative optimization algorithms by alternating between physics-based consistency and neural-network based regularization.
We propose Greedy LEarning for Accelerated MRI reconstruction, an efficient training strategy for high-dimensional imaging settings.
arXiv Detail & Related papers (2022-07-18T06:01:29Z) - On-Device Training Under 256KB Memory [62.95579393237751]
We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory.
Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB and 1MB Flash.
arXiv Detail & Related papers (2022-06-30T17:59:08Z) - Instant Neural Graphics Primitives with a Multiresolution Hash Encoding [67.33850633281803]
We present a versatile new input encoding that permits the use of a smaller network without sacrificing quality.
A small neural network is augmented by a multiresolution hash table of trainable feature vectors whose values are optimized through a gradient descent.
We achieve a combined speed of several orders of magnitude, enabling training of high-quality neural graphics primitives in a matter of seconds.
arXiv Detail & Related papers (2022-01-16T07:22:47Z) - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning [72.80896338009579]
We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs.
We propose a generic patch-by-patch inference scheduling, which significantly cuts down the peak memory.
We automate the process with neural architecture search to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2.
arXiv Detail & Related papers (2021-10-28T17:58:45Z) - perf4sight: A toolflow to model CNN training performance on Edge GPUs [16.61258138725983]
This work proposes perf4sight, an automated methodology for developing accurate models that predict CNN training memory footprint and latency.
With PyTorch as the framework and NVIDIA Jetson TX2 as the target device, the developed models predict training memory footprint and latency with 95% and 91% accuracy respectively.
arXiv Detail & Related papers (2021-08-12T07:55:37Z) - When deep learning models on GPU can be accelerated by taking advantage
of unstructured sparsity [0.0]
This paper is focused on the improvement the efficiency of the sparse convolutional neural networks (CNNs) layers on graphic processing units ( GPU)
The modern CNN models need megabytes of coefficients and needed millions MAC operations to perform convolution.
We show when is worth using a direct sparse operation to speed-up the calculation of the convolution layers.
arXiv Detail & Related papers (2020-11-12T10:13:48Z) - Optimizing Memory Placement using Evolutionary Graph Reinforcement
Learning [56.83172249278467]
We introduce Evolutionary Graph Reinforcement Learning (EGRL), a method designed for large search spaces.
We train and validate our approach directly on the Intel NNP-I chip for inference.
We additionally achieve 28-78% speed-up compared to the native NNP-I compiler on all three workloads.
arXiv Detail & Related papers (2020-07-14T18:50:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.