Understanding the Effects of Data Parallelism and Sparsity on Neural
Network Training
- URL: http://arxiv.org/abs/2003.11316v3
- Date: Fri, 2 Apr 2021 08:35:13 GMT
- Title: Understanding the Effects of Data Parallelism and Sparsity on Neural
Network Training
- Authors: Namhoon Lee, Thalaiyasingam Ajanthan, Philip H. S. Torr, Martin Jaggi
- Abstract summary: We study two factors in neural network training: data parallelism and sparsity.
Despite their promising benefits, understanding of their effects on neural network training remains elusive.
- Score: 126.49572353148262
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study two factors in neural network training: data parallelism and
sparsity; here, data parallelism means processing training data in parallel
using distributed systems (or equivalently increasing batch size), so that
training can be accelerated; for sparsity, we refer to pruning parameters in a
neural network model, so as to reduce computational and memory cost. Despite
their promising benefits, however, understanding of their effects on neural
network training remains elusive. In this work, we first measure these effects
rigorously by conducting extensive experiments while tuning all metaparameters
involved in the optimization. As a result, we find across various workloads of
data set, network model, and optimization algorithm that there exists a general
scaling trend between batch size and number of training steps to convergence
for the effect of data parallelism, and further, difficulty of training under
sparsity. Then, we develop a theoretical analysis based on the convergence
properties of stochastic gradient methods and smoothness of the optimization
landscape, which illustrates the observed phenomena precisely and generally,
establishing a better account of the effects of data parallelism and sparsity
on neural network training.
Related papers
- Partitioned Neural Network Training via Synthetic Intermediate Labels [0.0]
GPU memory constraints have become a notable bottleneck in training such sizable models.
This study advocates partitioning the model across GPU and generating synthetic intermediate labels to train individual segments.
This approach results in a more efficient training process that minimizes data communication while maintaining model accuracy.
arXiv Detail & Related papers (2024-03-17T13:06:29Z) - Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST)
IST is a recently proposed and highly effective technique for solving the aforementioned problems.
We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z) - Accelerated Training via Incrementally Growing Neural Networks using
Variance Transfer and Learning Rate Adaptation [34.7523496790944]
We develop an approach to efficiently grow neural networks, within which parameterization and optimization strategies are designed by considering the training dynamics.
We show that our method achieves comparable or better accuracy than training large fixed-size models, while saving a substantial portion of the original budget for training.
arXiv Detail & Related papers (2023-06-22T07:06:45Z) - No Wrong Turns: The Simple Geometry Of Neural Networks Optimization
Paths [12.068608358926317]
First-order optimization algorithms are known to efficiently locate favorable minima in deep neural networks.
We focus on the fundamental geometric properties of sampled quantities of optimization on two key paths.
Our findings suggest that not only do optimization trajectories never encounter significant obstacles, but they also maintain stable dynamics during the majority of training.
arXiv Detail & Related papers (2023-06-20T22:10:40Z) - Convergence and scaling of Boolean-weight optimization for hardware
reservoirs [0.0]
We analytically derive the scaling laws for highly efficient Coordinate Descent applied to optimize the readout layer of a random recurrently connection neural network.
Our results perfectly reproduce the convergence and scaling of a large-scale photonic reservoir implemented in a proof-of-concept experiment.
arXiv Detail & Related papers (2023-05-13T12:15:25Z) - Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - Data-driven emergence of convolutional structure in neural networks [83.4920717252233]
We show how fully-connected neural networks solving a discrimination task can learn a convolutional structure directly from their inputs.
By carefully designing data models, we show that the emergence of this pattern is triggered by the non-Gaussian, higher-order local structure of the inputs.
arXiv Detail & Related papers (2022-02-01T17:11:13Z) - How to Train Your Neural Network: A Comparative Evaluation [1.3654846342364304]
We discuss and compare current state-of-the-art frameworks for large scale distributed deep learning.
We present empirical results comparing their performance on large image and language training tasks.
Based on our results, we discuss algorithmic and implementation portions of each framework which hinder performance.
arXiv Detail & Related papers (2021-11-09T04:24:42Z) - Large-Scale Gradient-Free Deep Learning with Recursive Local
Representation Alignment [84.57874289554839]
Training deep neural networks on large-scale datasets requires significant hardware resources.
Backpropagation, the workhorse for training these networks, is an inherently sequential process that is difficult to parallelize.
We propose a neuro-biologically-plausible alternative to backprop that can be used to train deep networks.
arXiv Detail & Related papers (2020-02-10T16:20:02Z) - Large Batch Training Does Not Need Warmup [111.07680619360528]
Training deep neural networks using a large batch size has shown promising results and benefits many real-world applications.
In this paper, we propose a novel Complete Layer-wise Adaptive Rate Scaling (CLARS) algorithm for large-batch training.
Based on our analysis, we bridge the gap and illustrate the theoretical insights for three popular large-batch training techniques.
arXiv Detail & Related papers (2020-02-04T23:03:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.