XEngine: Optimal Tensor Rematerialization for Neural Networks in
Heterogeneous Environments
- URL: http://arxiv.org/abs/2212.09290v1
- Date: Mon, 19 Dec 2022 08:12:25 GMT
- Title: XEngine: Optimal Tensor Rematerialization for Neural Networks in
Heterogeneous Environments
- Authors: Manuela Schuler, Richard Membarth, Philipp Slusallek
- Abstract summary: We present XEngine, an approach that schedules network operators to heterogeneous devices in low memory environments.
Our solver finds solutions up to 22.5 % faster than the fastest Checkmate schedule in which the network is computed exclusively on a single device.
- Score: 3.769144330511514
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Memory efficiency is crucial in training deep learning networks on
resource-restricted devices. During backpropagation, forward tensors are used
to calculate gradients. Despite the option of keeping those dependencies in
memory until they are reused in backpropagation, some forward tensors can be
discarded and recomputed later from saved tensors, so-called checkpoints. This
allows, in particular, for resource-constrained heterogeneous environments to
make use of all available compute devices. Unfortunately, the definition of
these checkpoints is a non-trivial problem and poses a challenge to the
programmer - improper or excessive recomputations negate the benefit of
checkpointing.
In this article, we present XEngine, an approach that schedules network
operators to heterogeneous devices in low memory environments by determining
checkpoints and recomputations of tensors. Our approach selects suitable
resources per timestep and operator and optimizes the end-to-end time for
neural networks taking the memory limitation of each device into account. For
this, we formulate a mixed-integer quadratic program (MIQP) to schedule
operators of deep learning networks on heterogeneous systems. We compare our
MIQP solver XEngine against Checkmate, a mixed-integer linear programming
(MILP) approach that solves recomputation on a single device. Our solver finds
solutions that are up to 22.5 % faster than the fastest Checkmate schedule in
which the network is computed exclusively on a single device. We also find
valid schedules for networks making use of both central processing units and
graphics processing units if memory limitations do not allow scheduling
exclusively to the graphics processing unit.
Related papers
- GPU Memory Usage Optimization for Backward Propagation in Deep Network Training [4.444935537351665]
This paper focuses on efficiently finding the optimal checkpoint subset to achieve the least peak memory usage during the model training.
We first describe the theoretical background of the training of a neural network using mathematical equations.
We use these equations to identify all essential data required during both forward and backward phases to compute the gradient of weights of the model.
arXiv Detail & Related papers (2025-02-18T03:26:39Z) - Optimal Gradient Checkpointing for Sparse and Recurrent Architectures using Off-Chip Memory [0.8321953606016751]
We introduce memory-efficient gradient checkpointing strategies tailored for the general class of sparse RNNs and Spiking Neural Networks.
We find that Double Checkpointing emerges as the most effective method, optimizing the use of local memory resources while minimizing recomputation overhead.
arXiv Detail & Related papers (2024-12-16T14:23:31Z) - OLLA: Decreasing the Memory Usage of Neural Networks by Optimizing the
Lifetime and Location of Arrays [6.418232942455968]
OLLA is an algorithm that optimize the lifetime and memory location of the tensors used to train neural networks.
We present several techniques to simplify the encoding of the problem, and enable our approach to scale to the size of state-of-the-art neural networks.
arXiv Detail & Related papers (2022-10-24T02:39:13Z) - Efficient Dataset Distillation Using Random Feature Approximation [109.07737733329019]
We propose a novel algorithm that uses a random feature approximation (RFA) of the Neural Network Gaussian Process (NNGP) kernel.
Our algorithm provides at least a 100-fold speedup over KIP and can run on a single GPU.
Our new method, termed an RFA Distillation (RFAD), performs competitively with KIP and other dataset condensation algorithms in accuracy over a range of large-scale datasets.
arXiv Detail & Related papers (2022-10-21T15:56:13Z) - Variable Bitrate Neural Fields [75.24672452527795]
We present a dictionary method for compressing feature grids, reducing their memory consumption by up to 100x.
We formulate the dictionary optimization as a vector-quantized auto-decoder problem which lets us learn end-to-end discrete neural representations in a space where no direct supervision is available.
arXiv Detail & Related papers (2022-06-15T17:58:34Z) - A Communication-Efficient Distributed Gradient Clipping Algorithm for
Training Deep Neural Networks [11.461878019780597]
Gradient Descent might converge slowly in some deep neural networks.
It remains mysterious whether gradient clipping scheme can take advantage of multiple machines to enjoy parallel speedup.
arXiv Detail & Related papers (2022-05-10T16:55:33Z) - Fixed-Point Code Synthesis For Neural Networks [0.0]
A new technique is introduced to tune the formats (precision) of already trained neural networks using fixed-point arithmetic.
The new optimized neural network computes the output with fixed-point numbers without modifying the accuracy up to a threshold fixed by the user.
arXiv Detail & Related papers (2022-02-04T12:02:54Z) - Instant Neural Graphics Primitives with a Multiresolution Hash Encoding [67.33850633281803]
We present a versatile new input encoding that permits the use of a smaller network without sacrificing quality.
A small neural network is augmented by a multiresolution hash table of trainable feature vectors whose values are optimized through a gradient descent.
We achieve a combined speed of several orders of magnitude, enabling training of high-quality neural graphics primitives in a matter of seconds.
arXiv Detail & Related papers (2022-01-16T07:22:47Z) - Collaborative Learning over Wireless Networks: An Introductory Overview [84.09366153693361]
We will mainly focus on collaborative training across wireless devices.
Many distributed optimization algorithms have been developed over the last decades.
They provide data locality; that is, a joint model can be trained collaboratively while the data available at each participating device remains local.
arXiv Detail & Related papers (2021-12-07T20:15:39Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - Reservoir Stack Machines [77.12475691708838]
Memory-augmented neural networks equip a recurrent neural network with an explicit memory to support tasks that require information storage.
We introduce the reservoir stack machine, a model which can provably recognize all deterministic context-free languages.
Our results show that the reservoir stack machine achieves zero error, even on test sequences longer than the training data.
arXiv Detail & Related papers (2021-05-04T16:50:40Z) - ItNet: iterative neural networks with small graphs for accurate and
efficient anytime prediction [1.52292571922932]
In this study, we introduce a class of network models that have a small memory footprint in terms of their computational graphs.
We show state-of-the-art results for semantic segmentation on the CamVid and Cityscapes datasets.
arXiv Detail & Related papers (2021-01-21T15:56:29Z) - TASO: Time and Space Optimization for Memory-Constrained DNN Inference [5.023660118588569]
Convolutional neural networks (CNNs) are used in many embedded applications, from industrial robotics and automation systems to biometric identification on mobile devices.
We propose an approach for ahead-of-time domain specific optimization of CNN models, based on an integer linear programming (ILP) for selecting primitive operations to implement convolutional layers.
arXiv Detail & Related papers (2020-05-21T15:08:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.