Tunable Subnetwork Splitting for Model-parallelism of Neural Network
Training
- URL: http://arxiv.org/abs/2009.04053v2
- Date: Wed, 16 Sep 2020 21:18:59 GMT
- Title: Tunable Subnetwork Splitting for Model-parallelism of Neural Network
Training
- Authors: Junxiang Wang, Zheng Chai, Yue Cheng, Liang Zhao
- Abstract summary: We propose a Tunable Subnetwork Splitting Method (TSSM) to tune the decomposition of deep neural networks.
Our proposed TSSM can achieve significant speedup without observable loss of training accuracy.
- Score: 12.755664985045582
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Alternating minimization methods have recently been proposed as alternatives
to the gradient descent for deep neural network optimization. Alternating
minimization methods can typically decompose a deep neural network into
layerwise subproblems, which can then be optimized in parallel. Despite the
significant parallelism, alternating minimization methods are rarely explored
in training deep neural networks because of the severe accuracy degradation. In
this paper, we analyze the reason and propose to achieve a compelling trade-off
between parallelism and accuracy by a reformulation called Tunable Subnetwork
Splitting Method (TSSM), which can tune the decomposition granularity of deep
neural networks. Two methods gradient splitting Alternating Direction Method of
Multipliers (gsADMM) and gradient splitting Alternating Minimization (gsAM) are
proposed to solve the TSSM formulation. Experiments on five benchmark datasets
show that our proposed TSSM can achieve significant speedup without observable
loss of training accuracy. The code has been released at
https://github.com/xianggebenben/TSSM.
Related papers
- Improving Generalization of Deep Neural Networks by Optimum Shifting [33.092571599896814]
We propose a novel method called emphoptimum shifting, which changes the parameters of a neural network from a sharp minimum to a flatter one.
Our method is based on the observation that when the input and output of a neural network are fixed, the matrix multiplications within the network can be treated as systems of under-determined linear equations.
arXiv Detail & Related papers (2024-05-23T02:31:55Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - DeepSplit: Scalable Verification of Deep Neural Networks via Operator
Splitting [70.62923754433461]
Analyzing the worst-case performance of deep neural networks against input perturbations amounts to solving a large-scale non- optimization problem.
We propose a novel method that can directly solve a convex relaxation of the problem to high accuracy, by splitting it into smaller subproblems that often have analytical solutions.
arXiv Detail & Related papers (2021-06-16T20:43:49Z) - Partitioning sparse deep neural networks for scalable training and
inference [8.282177703075453]
State-of-the-art deep neural networks (DNNs) have significant computational and data management requirements.
Sparsification and pruning methods are shown to be effective in removing a large fraction of connections in DNNs.
The resulting sparse networks present unique challenges to further improve the computational efficiency of training and inference in deep learning.
arXiv Detail & Related papers (2021-04-23T20:05:52Z) - Local Critic Training for Model-Parallel Learning of Deep Neural
Networks [94.69202357137452]
We propose a novel model-parallel learning method, called local critic training.
We show that the proposed approach successfully decouples the update process of the layer groups for both convolutional neural networks (CNNs) and recurrent neural networks (RNNs)
We also show that trained networks by the proposed method can be used for structural optimization.
arXiv Detail & Related papers (2021-02-03T09:30:45Z) - Selfish Sparse RNN Training [13.165729746380816]
We propose an approach to train sparse RNNs with a fixed parameter count in one single run, without compromising performance.
We achieve state-of-the-art sparse training results with various datasets on Penn TreeBank and Wikitext-2.
arXiv Detail & Related papers (2021-01-22T10:45:40Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z) - Unsupervised Adaptive Neural Network Regularization for Accelerated
Radial Cine MRI [3.6280929178575994]
We propose an iterative reconstruction scheme for 2D radial cine MRI based on ground truth-free unsupervised learning of shallow convolutional neural networks.
The network is trained to approximate patches of the current estimate of the solution during the reconstruction.
arXiv Detail & Related papers (2020-02-10T14:47:20Z) - MSE-Optimal Neural Network Initialization via Layer Fusion [68.72356718879428]
Deep neural networks achieve state-of-the-art performance for a range of classification and inference tasks.
The use of gradient combined nonvolutionity renders learning susceptible to novel problems.
We propose fusing neighboring layers of deeper networks that are trained with random variables.
arXiv Detail & Related papers (2020-01-28T18:25:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.