Just One Byte (per gradient): A Note on Low-Bandwidth Decentralized
Language Model Finetuning Using Shared Randomness
- URL: http://arxiv.org/abs/2306.10015v1
- Date: Fri, 16 Jun 2023 17:59:51 GMT
- Title: Just One Byte (per gradient): A Note on Low-Bandwidth Decentralized
Language Model Finetuning Using Shared Randomness
- Authors: Eric Zelikman, Qian Huang, Percy Liang, Nick Haber, Noah D. Goodman
- Abstract summary: Language model training in distributed settings is limited by the communication cost of exchanges.
We extend recent work using shared randomness to perform distributed fine-tuning with low bandwidth.
- Score: 86.61582747039053
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language model training in distributed settings is limited by the
communication cost of gradient exchanges. In this short note, we extend recent
work from Malladi et al. (2023), using shared randomness to perform distributed
fine-tuning with low bandwidth. The method is a natural decentralized extension
of memory-efficient Simultaneous Perturbation Stochastic Approximation (SPSA).
Each iteration, each machine seeds a Random Number Generator (RNG) to perform
local reproducible perturbations on model weights and calculate and exchange
scalar projected gradients, which are then used to update each model. By using
a (machine, sample) identifier as the random seed, each model can regenerate
one another's perturbations. As machines only exchange single-byte projected
gradients, this is highly communication efficient. There are also potential
privacy benefits, as projected gradients may be calculated on different
training data, and models never access the other's data. Our approach not only
drastically reduces communication bandwidth requirements but also accommodates
dynamic addition or removal of machines during the training process and retains
the memory-efficient and inference-only advantages of recent work. We perform
proof-of-concept experiments to demonstrate the potential usefulness of this
method, building off of rich literature on distributed optimization and
memory-efficient training.
Related papers
- Transferable Post-training via Inverse Value Learning [83.75002867411263]
We propose modeling changes at the logits level during post-training using a separate neural network (i.e., the value network)
After training this network on a small base model using demonstrations, this network can be seamlessly integrated with other pre-trained models during inference.
We demonstrate that the resulting value network has broad transferability across pre-trained models of different parameter sizes.
arXiv Detail & Related papers (2024-10-28T13:48:43Z) - High-Dimensional Distributed Sparse Classification with Scalable Communication-Efficient Global Updates [50.406127962933915]
We develop solutions to problems which enable us to learn a communication-efficient distributed logistic regression model.
In our experiments we demonstrate a large improvement in accuracy over distributed algorithms with only a few distributed update steps needed.
arXiv Detail & Related papers (2024-07-08T19:34:39Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - SalientGrads: Sparse Models for Communication Efficient and Data Aware
Distributed Federated Training [1.0413504599164103]
Federated learning (FL) enables the training of a model leveraging decentralized data in client sites while preserving privacy by not collecting data.
One of the significant challenges of FL is limited computation and low communication bandwidth in resource limited edge client nodes.
We propose Salient Grads, which simplifies the process of sparse training by choosing a data aware subnetwork before training.
arXiv Detail & Related papers (2023-04-15T06:46:37Z) - Scaling Structured Inference with Randomization [64.18063627155128]
We propose a family of dynamic programming (RDP) randomized for scaling structured models to tens of thousands of latent states.
Our method is widely applicable to classical DP-based inference.
It is also compatible with automatic differentiation so can be integrated with neural networks seamlessly.
arXiv Detail & Related papers (2021-12-07T11:26:41Z) - Training Recommender Systems at Scale: Communication-Efficient Model and
Data Parallelism [56.78673028601739]
We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training.
DCT reduces communication by at least $100times$ and $20times$ during DP and MP, respectively.
It improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance.
arXiv Detail & Related papers (2020-10-18T01:44:42Z) - PushNet: Efficient and Adaptive Neural Message Passing [1.9121961872220468]
Message passing neural networks have recently evolved into a state-of-the-art approach to representation learning on graphs.
Existing methods perform synchronous message passing along all edges in multiple subsequent rounds.
We consider a novel asynchronous message passing approach where information is pushed only along the most relevant edges until convergence.
arXiv Detail & Related papers (2020-03-04T18:15:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.