Low Precision Decentralized Distributed Training with Heterogeneous Data
- URL: http://arxiv.org/abs/2111.09389v1
- Date: Wed, 17 Nov 2021 20:48:09 GMT
- Title: Low Precision Decentralized Distributed Training with Heterogeneous Data
- Authors: Sai Aparna Aketi, Sangamesh Kodge, Kaushik Roy
- Abstract summary: We show the convergence of low precision decentralized training that aims to reduce the computational complexity of training and inference.
Experiments indicate that 8-bit decentralized training has minimal accuracy loss compared to its full precision counterpart even with heterogeneous data.
- Score: 5.43185002439223
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Decentralized distributed learning is the key to enabling large-scale machine
learning (training) on the edge devices utilizing private user-generated local
data, without relying on the cloud. However, practical realization of such
on-device training is limited by the communication bottleneck, computation
complexity of training deep models and significant data distribution skew
across devices. Many feedback-based compression techniques have been proposed
in the literature to reduce the communication cost and a few works propose
algorithmic changes to aid the performance in the presence of skewed data
distribution by improving convergence rate. To the best of our knowledge, there
is no work in the literature that applies and shows compute efficient training
techniques such quantization, pruning etc., for peer-to-peer decentralized
learning setups. In this paper, we analyze and show the convergence of low
precision decentralized training that aims to reduce the computational
complexity of training and inference. Further, We study the effect of degree of
skew and communication compression on the low precision decentralized training
over various computer vision and Natural Language Processing (NLP) tasks. Our
experiments indicate that 8-bit decentralized training has minimal accuracy
loss compared to its full precision counterpart even with heterogeneous data.
However, when low precision training is accompanied by communication
compression through sparsification we observe 1-2% drop in accuracy. The
proposed low precision decentralized training decreases computational
complexity, memory usage, and communication cost by ~4x while trading off less
than a 1% accuracy for both IID and non-IID data. In particular, with higher
skew values, we observe an increase in accuracy (by ~0.5%) with low precision
training, indicating the regularization effect of the quantization.
Related papers
- KAKURENBO: Adaptively Hiding Samples in Deep Neural Network Training [2.8804804517897935]
We propose a method for hiding the least-important samples during the training of deep neural networks.
We adaptively find samples to exclude in a given epoch based on their contribution to the overall learning process.
Our method can reduce total training time by up to 22% impacting accuracy only by 0.4% compared to the baseline.
arXiv Detail & Related papers (2023-10-16T06:19:29Z) - Guaranteed Approximation Bounds for Mixed-Precision Neural Operators [83.64404557466528]
We build on intuition that neural operator learning inherently induces an approximation error.
We show that our approach reduces GPU memory usage by up to 50% and improves throughput by 58% with little or no reduction in accuracy.
arXiv Detail & Related papers (2023-07-27T17:42:06Z) - Efficient Augmentation for Imbalanced Deep Learning [8.38844520504124]
We study a convolutional neural network's internal representation of imbalanced image data.
We measure the generalization gap between a model's feature embeddings in the training and test sets, showing that the gap is wider for minority classes.
This insight enables us to design an efficient three-phase CNN training framework for imbalanced data.
arXiv Detail & Related papers (2022-07-13T09:43:17Z) - Efficient training of lightweight neural networks using Online
Self-Acquired Knowledge Distillation [51.66271681532262]
Online Self-Acquired Knowledge Distillation (OSAKD) is proposed, aiming to improve the performance of any deep neural model in an online manner.
We utilize k-nn non-parametric density estimation technique for estimating the unknown probability distributions of the data samples in the output feature space.
arXiv Detail & Related papers (2021-08-26T14:01:04Z) - AC/DC: Alternating Compressed/DeCompressed Training of Deep Neural
Networks [78.62086125399831]
We present a general approach called Alternating Compressed/DeCompressed (AC/DC) training of deep neural networks (DNNs)
AC/DC outperforms existing sparse training methods in accuracy at similar computational budgets.
An important property of AC/DC is that it allows co-training of dense and sparse models, yielding accurate sparse-dense model pairs at the end of the training process.
arXiv Detail & Related papers (2021-06-23T13:23:00Z) - Data optimization for large batch distributed training of deep neural
networks [0.19336815376402716]
Current practice for distributed training of deep neural networks faces the challenges of communication bottlenecks when operating at scale.
We propose a data optimization approach that utilize machine learning to implicitly smooth out the loss landscape resulting in fewer local minima.
Our approach filters out data points which are less important to feature learning, enabling us to speed up the training of models on larger batch sizes to improved accuracy.
arXiv Detail & Related papers (2020-12-16T21:22:02Z) - Predicting Training Time Without Training [120.92623395389255]
We tackle the problem of predicting the number of optimization steps that a pre-trained deep network needs to converge to a given value of the loss function.
We leverage the fact that the training dynamics of a deep network during fine-tuning are well approximated by those of a linearized model.
We are able to predict the time it takes to fine-tune a model to a given loss without having to perform any training.
arXiv Detail & Related papers (2020-08-28T04:29:54Z) - Step-Ahead Error Feedback for Distributed Training with Compressed
Gradient [99.42912552638168]
We show that a new "gradient mismatch" problem is raised by the local error feedback in centralized distributed training.
We propose two novel techniques, 1) step ahead and 2) error averaging, with rigorous theoretical analysis.
arXiv Detail & Related papers (2020-08-13T11:21:07Z) - Enabling On-Device CNN Training by Self-Supervised Instance Filtering
and Error Map Pruning [17.272561332310303]
This work aims to enable on-device training of convolutional neural networks (CNNs) by reducing the computation cost at training time.
CNN models are usually trained on high-performance computers and only the trained models are deployed to edge devices.
arXiv Detail & Related papers (2020-07-07T05:52:37Z) - Towards Efficient Scheduling of Federated Mobile Devices under
Computational and Statistical Heterogeneity [16.069182241512266]
This paper studies the implementation of distributed learning on mobile devices.
We use data as a tuning knob and propose two efficient-time algorithms to schedule different workloads.
Compared with the common benchmarks, the proposed algorithms achieve 2-100x speedup-wise, 2-7% accuracy gain and convergence rate by more than 100% on CIFAR10.
arXiv Detail & Related papers (2020-05-25T18:21:51Z) - Understanding the Effects of Data Parallelism and Sparsity on Neural
Network Training [126.49572353148262]
We study two factors in neural network training: data parallelism and sparsity.
Despite their promising benefits, understanding of their effects on neural network training remains elusive.
arXiv Detail & Related papers (2020-03-25T10:49:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.