Pufferfish: Communication-efficient Models At No Extra Cost
- URL: http://arxiv.org/abs/2103.03936v1
- Date: Fri, 5 Mar 2021 20:46:39 GMT
- Title: Pufferfish: Communication-efficient Models At No Extra Cost
- Authors: Hongyi Wang, Saurabh Agarwal, Dimitris Papailiopoulos
- Abstract summary: Pufferfish is a communication and efficient distributed training framework.
It incorporates the gradient compression into the model training process via training low-rank, pre-factorized deep networks.
It achieves the same accuracy as state-of-the-art, off-the-shelf deep models.
- Score: 7.408148824204065
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To mitigate communication overheads in distributed model training, several
studies propose the use of compressed stochastic gradients, usually achieved by
sparsification or quantization. Such techniques achieve high compression
ratios, but in many cases incur either significant computational overheads or
some accuracy loss. In this work, we present Pufferfish, a communication and
computation efficient distributed training framework that incorporates the
gradient compression into the model training process via training low-rank,
pre-factorized deep networks. Pufferfish not only reduces communication, but
also completely bypasses any computation overheads related to compression, and
achieves the same accuracy as state-of-the-art, off-the-shelf deep models.
Pufferfish can be directly integrated into current deep learning frameworks
with minimum implementation modification. Our extensive experiments over real
distributed setups, across a variety of large-scale machine learning tasks,
indicate that Pufferfish achieves up to 1.64x end-to-end speedup over the
latest distributed training API in PyTorch without accuracy loss. Compared to
the Lottery Ticket Hypothesis models, Pufferfish leads to equally accurate,
small-parameter models while avoiding the burden of "winning the lottery".
Pufferfish also leads to more accurate and smaller models than SOTA structured
model pruning methods.
Related papers
- Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence.
We find that gradients require milder compression rates than activations.
Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z) - Retraining-free Model Quantization via One-Shot Weight-Coupling Learning [41.299675080384]
Mixed-precision quantization (MPQ) is advocated to compress the model effectively by allocating heterogeneous bit-width for layers.
MPQ is typically organized into a searching-retraining two-stage process.
In this paper, we devise a one-shot training-searching paradigm for mixed-precision model compression.
arXiv Detail & Related papers (2024-01-03T05:26:57Z) - Instant Complexity Reduction in CNNs using Locality-Sensitive Hashing [50.79602839359522]
We propose HASTE (Hashing for Tractable Efficiency), a parameter-free and data-free module that acts as a plug-and-play replacement for any regular convolution module.
We are able to drastically compress latent feature maps without sacrificing much accuracy by using locality-sensitive hashing (LSH)
In particular, we are able to instantly drop 46.72% of FLOPs while only losing 1.25% accuracy by just swapping the convolution modules in a ResNet34 on CIFAR-10 for our HASTE module.
arXiv Detail & Related papers (2023-09-29T13:09:40Z) - Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST)
IST is a recently proposed and highly effective technique for solving the aforementioned problems.
We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z) - Cuttlefish: Low-Rank Model Training without All the Tuning [55.984294012024755]
We introduce Cuttlefish, an automated low-rank training approach.
Cuttlefish switches from full-rank to low-rank training once the stable ranks of all layers have converged.
Our results show that Cuttlefish generates models up to 5.6 times smaller than full-rank models, and attains up to a 1.2 times faster end-to-end training process.
arXiv Detail & Related papers (2023-05-04T04:20:20Z) - Distributed Pruning Towards Tiny Neural Networks in Federated Learning [12.63559789381064]
FedTiny is a distributed pruning framework for federated learning.
It generates specialized tiny models for memory- and computing-constrained devices.
It achieves an accuracy improvement of 2.61% while significantly reducing the computational cost by 95.91%.
arXiv Detail & Related papers (2022-12-05T01:58:45Z) - Paoding: Supervised Robustness-preserving Data-free Neural Network
Pruning [3.6953655494795776]
We study the neural network pruning in the emphdata-free context.
We replace the traditional aggressive one-shot strategy with a conservative one that treats the pruning as a progressive process.
Our method is implemented as a Python package named textscPaoding and evaluated with a series of experiments on diverse neural network models.
arXiv Detail & Related papers (2022-04-02T07:09:17Z) - ScaleCom: Scalable Sparsified Gradient Compression for
Communication-Efficient Distributed Training [74.43625662170284]
Large-scale distributed training of Deep Neural Networks (DNNs) on state-of-the-art platforms is expected to be severely communication constrained.
We propose a new compression technique that leverages similarity in the gradient distribution amongst learners to provide significantly improved scalability.
We experimentally demonstrate that ScaleCom has small overheads, directly reduces gradient traffic and provides high compression rates (65-400X) and excellent scalability (up to 64 learners and 8-12X larger batch sizes over standard training) without significant accuracy loss.
arXiv Detail & Related papers (2021-04-21T02:22:10Z) - A Partial Regularization Method for Network Compression [0.0]
We propose an approach of partial regularization rather than the original form of penalizing all parameters, which is said to be full regularization, to conduct model compression at a higher speed.
Experimental results show that as we expected, the computational complexity is reduced by observing less running time in almost all situations.
Surprisingly, it helps to improve some important metrics such as regression fitting results and classification accuracy in both training and test phases on multiple datasets.
arXiv Detail & Related papers (2020-09-03T00:38:27Z) - Train Large, Then Compress: Rethinking Model Size for Efficient Training
and Inference of Transformers [94.43313684188819]
We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute.
We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps.
This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models.
arXiv Detail & Related papers (2020-02-26T21:17:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.