Out-of-core Training for Extremely Large-Scale Neural Networks With
Adaptive Window-Based Scheduling
- URL: http://arxiv.org/abs/2010.14109v1
- Date: Tue, 27 Oct 2020 07:40:04 GMT
- Title: Out-of-core Training for Extremely Large-Scale Neural Networks With
Adaptive Window-Based Scheduling
- Authors: Akio Hayakawa, Takuya Narihira
- Abstract summary: We propose a novel out-of-core algorithm that enables faster training of extremely large-scale neural networks with sizes larger than allotted GPU memory.
We apply virtual addressing technique, commonly performed in OS, to training of neural networks with out-of-core execution.
We successfully train ResNet-50 with 1440 batch-size with keeping training speed at 55%, which is 7.5x larger than the upper bound of physical memory.
- Score: 4.903820815918411
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While large neural networks demonstrate higher performance in various tasks,
training large networks is difficult due to limitations on GPU memory size. We
propose a novel out-of-core algorithm that enables faster training of extremely
large-scale neural networks with sizes larger than allotted GPU memory. Under a
given memory budget constraint, our scheduling algorithm locally adapts the
timing of memory transfers according to memory usage of each function, which
improves overlap between computation and memory transfers. Additionally, we
apply virtual addressing technique, commonly performed in OS, to training of
neural networks with out-of-core execution, which drastically reduces the
amount of memory fragmentation caused by frequent memory transfers. With our
proposed algorithm, we successfully train ResNet-50 with 1440 batch-size with
keeping training speed at 55%, which is 7.5x larger than the upper bound of
physical memory. It also outperforms a previous state-of-the-art substantially,
i.e. it trains a 1.55x larger network than state-of-the-art with faster
execution. Moreover, we experimentally show that our approach is also scalable
for various types of networks.
Related papers
- OLLA: Decreasing the Memory Usage of Neural Networks by Optimizing the
Lifetime and Location of Arrays [6.418232942455968]
OLLA is an algorithm that optimize the lifetime and memory location of the tensors used to train neural networks.
We present several techniques to simplify the encoding of the problem, and enable our approach to scale to the size of state-of-the-art neural networks.
arXiv Detail & Related papers (2022-10-24T02:39:13Z) - GLEAM: Greedy Learning for Large-Scale Accelerated MRI Reconstruction [50.248694764703714]
Unrolled neural networks have recently achieved state-of-the-art accelerated MRI reconstruction.
These networks unroll iterative optimization algorithms by alternating between physics-based consistency and neural-network based regularization.
We propose Greedy LEarning for Accelerated MRI reconstruction, an efficient training strategy for high-dimensional imaging settings.
arXiv Detail & Related papers (2022-07-18T06:01:29Z) - Instant Neural Graphics Primitives with a Multiresolution Hash Encoding [67.33850633281803]
We present a versatile new input encoding that permits the use of a smaller network without sacrificing quality.
A small neural network is augmented by a multiresolution hash table of trainable feature vectors whose values are optimized through a gradient descent.
We achieve a combined speed of several orders of magnitude, enabling training of high-quality neural graphics primitives in a matter of seconds.
arXiv Detail & Related papers (2022-01-16T07:22:47Z) - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning [72.80896338009579]
We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs.
We propose a generic patch-by-patch inference scheduling, which significantly cuts down the peak memory.
We automate the process with neural architecture search to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2.
arXiv Detail & Related papers (2021-10-28T17:58:45Z) - Towards Memory-Efficient Neural Networks via Multi-Level in situ
Generation [10.563649948220371]
Deep neural networks (DNN) have shown superior performance in a variety of tasks.
As they rapidly evolve, their escalating computation and memory demands make it challenging to deploy them on resource-constrained edge devices.
We propose a general and unified framework to trade expensive memory transactions with ultra-fast on-chip computations.
arXiv Detail & Related papers (2021-08-25T18:50:24Z) - DistGNN: Scalable Distributed Training for Large-Scale Graph Neural
Networks [58.48833325238537]
Full-batch training on Graph Neural Networks (GNN) to learn the structure of large graphs is a critical problem that needs to scale to hundreds of compute nodes to be feasible.
In this paper, we presentGNN that optimize the well-known Deep Graph Library (DGL) for full-batch training on CPU clusters.
Our results on four common GNN benchmark datasets show up to 3.7x speed-up using a single CPU socket and up to 97x speed-up using 128 CPU sockets.
arXiv Detail & Related papers (2021-04-14T08:46:35Z) - Binary Graph Neural Networks [69.51765073772226]
Graph Neural Networks (GNNs) have emerged as a powerful and flexible framework for representation learning on irregular data.
In this paper, we present and evaluate different strategies for the binarization of graph neural networks.
We show that through careful design of the models, and control of the training process, binary graph neural networks can be trained at only a moderate cost in accuracy on challenging benchmarks.
arXiv Detail & Related papers (2020-12-31T18:48:58Z) - Robust High-dimensional Memory-augmented Neural Networks [13.82206983716435]
Memory-augmented neural networks enhance neural networks with an explicit memory to overcome these issues.
Access to this explicit memory occurs via soft read and write operations involving every individual memory entry.
We propose a robust architecture that employs a computational memory unit as the explicit memory performing analog in-memory computation on high-dimensional (HD) vectors.
arXiv Detail & Related papers (2020-10-05T12:01:56Z) - Improving Memory Utilization in Convolutional Neural Network
Accelerators [16.340620299847384]
We propose a mapping method that allows activation layers to overlap and thus utilize the memory more efficiently.
Experiments with various real-world object detector networks show that the proposed mapping technique can decrease the activations memory by up to 32.9%.
For higher resolution de-noising networks, we achieve activation memory savings of 48.8%.
arXiv Detail & Related papers (2020-07-20T09:34:36Z) - Optimizing Memory Placement using Evolutionary Graph Reinforcement
Learning [56.83172249278467]
We introduce Evolutionary Graph Reinforcement Learning (EGRL), a method designed for large search spaces.
We train and validate our approach directly on the Intel NNP-I chip for inference.
We additionally achieve 28-78% speed-up compared to the native NNP-I compiler on all three workloads.
arXiv Detail & Related papers (2020-07-14T18:50:12Z) - Ordering Chaos: Memory-Aware Scheduling of Irregularly Wired Neural
Networks for Edge Devices [10.876317610988059]
We present a memory-aware compiler, dubbed SERENITY, that finds a sequence that finds a schedule with optimal memory footprint.
Our solution also comprises of graph rewriting technique that allows further reduction beyond the optimum.
arXiv Detail & Related papers (2020-03-04T23:38:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.