Dynamic Tensor Rematerialization
- URL: http://arxiv.org/abs/2006.09616v4
- Date: Thu, 18 Mar 2021 06:20:23 GMT
- Title: Dynamic Tensor Rematerialization
- Authors: Marisa Kirisame, Steven Lyubomirsky, Altan Haan, Jennifer Brennan,
Mike He, Jared Roesch, Tianqi Chen, and Zachary Tatlock
- Abstract summary: Checkpointing enables the training of deep learning models under restricted memory budgets.
Current checkpointing techniques statically plan these recomputations offline and assume static graphs.
We demonstrate that a simple online algorithm can achieve comparable performance by introducing Dynamic Rematerialization (DTR)
- Score: 11.204761128308542
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Checkpointing enables the training of deep learning models under restricted
memory budgets by freeing intermediate activations from memory and recomputing
them on demand. Current checkpointing techniques statically plan these
recomputations offline and assume static computation graphs. We demonstrate
that a simple online algorithm can achieve comparable performance by
introducing Dynamic Tensor Rematerialization (DTR), a greedy online algorithm
for checkpointing that is extensible and general, is parameterized by eviction
policy, and supports dynamic models. We prove that DTR can train an $N$-layer
linear feedforward network on an $\Omega(\sqrt{N})$ memory budget with only
$\mathcal{O}(N)$ tensor operations. DTR closely matches the performance of
optimal static checkpointing in simulated experiments. We incorporate a DTR
prototype into PyTorch merely by interposing on tensor allocations and operator
calls and collecting lightweight metadata on tensors.
Related papers
- Efficient k-Nearest-Neighbor Machine Translation with Dynamic Retrieval [49.825549809652436]
$k$NN-MT constructs an external datastore to store domain-specific translation knowledge.
adaptive retrieval ($k$NN-MT-AR) dynamically estimates $lambda$ and skips $k$NN retrieval if $lambda$ is less than a fixed threshold.
We propose dynamic retrieval ($k$NN-MT-DR) that significantly extends vanilla $k$NN-MT in two aspects.
arXiv Detail & Related papers (2024-06-10T07:36:55Z) - Finding Lottery Tickets in Vision Models via Data-driven Spectral Foresight Pruning [14.792099973449794]
We propose an algorithm to align the training dynamics of the sparse network with that of the dense one.
We show how the usually neglected data-dependent component in the NTK's spectrum can be taken into account.
Path eXclusion (PX) is able to find lottery tickets even at high sparsity levels.
arXiv Detail & Related papers (2024-06-03T22:19:42Z) - Mixture-of-Depths: Dynamically allocating compute in transformer-based language models [8.774705201394916]
Transformer-based language models spread FLOPs uniformly across input sequences.
We show that transformers can learn to dynamically allocate FLOPs to specific positions in a sequence.
arXiv Detail & Related papers (2024-04-02T19:28:11Z) - Tensor Decomposition Based Attention Module for Spiking Neural Networks [18.924242014716647]
We design the textitprojected full attention (PFA) module, which demonstrates excellent results with linearly growing parameters.
Our method achieves state-of-the-art performance on both static and dynamic benchmark datasets.
arXiv Detail & Related papers (2023-10-23T05:25:49Z) - Truncated tensor Schatten p-norm based approach for spatiotemporal
traffic data imputation with complicated missing patterns [77.34726150561087]
We introduce four complicated missing patterns, including missing and three fiber-like missing cases according to the mode-drivenn fibers.
Despite nonity of the objective function in our model, we derive the optimal solutions by integrating alternating data-mputation method of multipliers.
arXiv Detail & Related papers (2022-05-19T08:37:56Z) - Pretraining Graph Neural Networks for few-shot Analog Circuit Modeling
and Design [68.1682448368636]
We present a supervised pretraining approach to learn circuit representations that can be adapted to new unseen topologies or unseen prediction tasks.
To cope with the variable topological structure of different circuits we describe each circuit as a graph and use graph neural networks (GNNs) to learn node embeddings.
We show that pretraining GNNs on prediction of output node voltages can encourage learning representations that can be adapted to new unseen topologies or prediction of new circuit level properties.
arXiv Detail & Related papers (2022-03-29T21:18:47Z) - Unfolding Projection-free SDP Relaxation of Binary Graph Classifier via
GDPA Linearization [59.87663954467815]
Algorithm unfolding creates an interpretable and parsimonious neural network architecture by implementing each iteration of a model-based algorithm as a neural layer.
In this paper, leveraging a recent linear algebraic theorem called Gershgorin disc perfect alignment (GDPA), we unroll a projection-free algorithm for semi-definite programming relaxation (SDR) of a binary graph.
Experimental results show that our unrolled network outperformed pure model-based graph classifiers, and achieved comparable performance to pure data-driven networks but using far fewer parameters.
arXiv Detail & Related papers (2021-09-10T07:01:15Z) - Online Limited Memory Neural-Linear Bandits with Likelihood Matching [53.18698496031658]
We study neural-linear bandits for solving problems where both exploration and representation learning play an important role.
We propose a likelihood matching algorithm that is resilient to catastrophic forgetting and is completely online.
arXiv Detail & Related papers (2021-02-07T14:19:07Z) - On Coresets for Support Vector Machines [61.928187390362176]
A coreset is a small, representative subset of the original data points.
We show that our algorithm can be used to extend the applicability of any off-the-shelf SVM solver to streaming, distributed, and dynamic data settings.
arXiv Detail & Related papers (2020-02-15T23:25:12Z) - A Multi-Scale Tensor Network Architecture for Classification and
Regression [0.0]
We present an algorithm for supervised learning using tensor networks.
We employ a step of preprocessing the data by coarse-graining through a sequence of wavelet transformations.
We show how fine-graining through the network may be used to initialize models with access to finer-scale features.
arXiv Detail & Related papers (2020-01-22T21:26:28Z) - Sparse and Low-Rank High-Order Tensor Regression via Parallel Proximal
Method [6.381138694845438]
We propose the Sparse and Low-rank Regression model for large-scale data with high-order structures.
Our model enforces sparsity and low-rankness of the tensor coefficient.
Our model's predictions exhibit meaningful interpretations on the video dataset.
arXiv Detail & Related papers (2019-11-29T06:25:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.