Related papers: Just One Byte (per gradient): A Note on Low-Bandwidth Decentralized Language Model Finetuning Using Shared Randomness

Just One Byte (per gradient): A Note on Low-Bandwidth Decentralized Language Model Finetuning Using Shared Randomness

URL: http://arxiv.org/abs/2306.10015v1
Date: Fri, 16 Jun 2023 17:59:51 GMT
Title: Just One Byte (per gradient): A Note on Low-Bandwidth Decentralized Language Model Finetuning Using Shared Randomness
Authors: Eric Zelikman, Qian Huang, Percy Liang, Nick Haber, Noah D. Goodman
Abstract summary: Language model training in distributed settings is limited by the communication cost of exchanges. We extend recent work using shared randomness to perform distributed fine-tuning with low bandwidth.
Score: 86.61582747039053
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Language model training in distributed settings is limited by the communication cost of gradient exchanges. In this short note, we extend recent work from Malladi et al. (2023), using shared randomness to perform distributed fine-tuning with low bandwidth. The method is a natural decentralized extension of memory-efficient Simultaneous Perturbation Stochastic Approximation (SPSA). Each iteration, each machine seeds a Random Number Generator (RNG) to perform local reproducible perturbations on model weights and calculate and exchange scalar projected gradients, which are then used to update each model. By using a (machine, sample) identifier as the random seed, each model can regenerate one another's perturbations. As machines only exchange single-byte projected gradients, this is highly communication efficient. There are also potential privacy benefits, as projected gradients may be calculated on different training data, and models never access the other's data. Our approach not only drastically reduces communication bandwidth requirements but also accommodates dynamic addition or removal of machines during the training process and retains the memory-efficient and inference-only advantages of recent work. We perform proof-of-concept experiments to demonstrate the potential usefulness of this method, building off of rich literature on distributed optimization and memory-efficient training.

Related papers

Transferable Post-training via Inverse Value Learning [83.75002867411263]
We propose modeling changes at the logits level during post-training using a separate neural network (i.e., the value network) After training this network on a small base model using demonstrations, this network can be seamlessly integrated with other pre-trained models during inference. We demonstrate that the resulting value network has broad transferability across pre-trained models of different parameter sizes.
arXiv Detail & Related papers (2024-10-28T13:48:43Z)
High-Dimensional Distributed Sparse Classification with Scalable Communication-Efficient Global Updates [50.406127962933915]
We develop solutions to problems which enable us to learn a communication-efficient distributed logistic regression model. In our experiments we demonstrate a large improvement in accuracy over distributed algorithms with only a few distributed update steps needed.
arXiv Detail & Related papers (2024-07-08T19:34:39Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
SalientGrads: Sparse Models for Communication Efficient and Data Aware Distributed Federated Training [1.0413504599164103]
Federated learning (FL) enables the training of a model leveraging decentralized data in client sites while preserving privacy by not collecting data. One of the significant challenges of FL is limited computation and low communication bandwidth in resource limited edge client nodes. We propose Salient Grads, which simplifies the process of sparse training by choosing a data aware subnetwork before training.
arXiv Detail & Related papers (2023-04-15T06:46:37Z)
Scaling Structured Inference with Randomization [64.18063627155128]
We propose a family of dynamic programming (RDP) randomized for scaling structured models to tens of thousands of latent states. Our method is widely applicable to classical DP-based inference. It is also compatible with automatic differentiation so can be integrated with neural networks seamlessly.
arXiv Detail & Related papers (2021-12-07T11:26:41Z)
Training Recommender Systems at Scale: Communication-Efficient Model and Data Parallelism [56.78673028601739]
We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training. DCT reduces communication by at least $100times$ and $20times$ during DP and MP, respectively. It improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance.
arXiv Detail & Related papers (2020-10-18T01:44:42Z)
PushNet: Efficient and Adaptive Neural Message Passing [1.9121961872220468]
Message passing neural networks have recently evolved into a state-of-the-art approach to representation learning on graphs. Existing methods perform synchronous message passing along all edges in multiple subsequent rounds. We consider a novel asynchronous message passing approach where information is pushed only along the most relevant edges until convergence.
arXiv Detail & Related papers (2020-03-04T18:15:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.