Training Recommender Systems at Scale: Communication-Efficient Model and
Data Parallelism
- URL: http://arxiv.org/abs/2010.08899v2
- Date: Fri, 21 May 2021 08:23:19 GMT
- Title: Training Recommender Systems at Scale: Communication-Efficient Model and
Data Parallelism
- Authors: Vipul Gupta, Dhruv Choudhary, Ping Tak Peter Tang, Xiaohan Wei, Xing
Wang, Yuzhen Huang, Arun Kejariwal, Kannan Ramchandran, Michael W. Mahoney
- Abstract summary: We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training.
DCT reduces communication by at least $100times$ and $20times$ during DP and MP, respectively.
It improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance.
- Score: 56.78673028601739
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we consider hybrid parallelism -- a paradigm that employs both
Data Parallelism (DP) and Model Parallelism (MP) -- to scale distributed
training of large recommendation models. We propose a compression framework
called Dynamic Communication Thresholding (DCT) for communication-efficient
hybrid training. DCT filters the entities to be communicated across the network
through a simple hard-thresholding function, allowing only the most relevant
information to pass through. For communication efficient DP, DCT compresses the
parameter gradients sent to the parameter server during model synchronization.
The threshold is updated only once every few thousand iterations to reduce the
computational overhead of compression. For communication efficient MP, DCT
incorporates a novel technique to compress the activations and gradients sent
across the network during the forward and backward propagation, respectively.
This is done by identifying and updating only the most relevant neurons of the
neural network for each training sample in the data. We evaluate DCT on
publicly available natural language processing and recommender models and
datasets, as well as recommendation systems used in production at Facebook. DCT
reduces communication by at least $100\times$ and $20\times$ during DP and MP,
respectively. The algorithm has been deployed in production, and it improves
end-to-end training time for a state-of-the-art industrial recommender model by
37\%, without any loss in performance.
Related papers
- Partitioned Neural Network Training via Synthetic Intermediate Labels [0.0]
GPU memory constraints have become a notable bottleneck in training such sizable models.
This study advocates partitioning the model across GPU and generating synthetic intermediate labels to train individual segments.
This approach results in a more efficient training process that minimizes data communication while maintaining model accuracy.
arXiv Detail & Related papers (2024-03-17T13:06:29Z) - Communication-Efficient Federated Learning through Adaptive Weight
Clustering and Server-Side Distillation [10.541541376305245]
Federated Learning (FL) is a promising technique for the collaborative training of deep neural networks across multiple devices.
FL is hindered by excessive communication costs due to repeated server-client communication during training.
We propose FedCompress, a novel approach that combines dynamic weight clustering and server-side knowledge distillation.
arXiv Detail & Related papers (2024-01-25T14:49:15Z) - Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST)
IST is a recently proposed and highly effective technique for solving the aforementioned problems.
We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z) - Just One Byte (per gradient): A Note on Low-Bandwidth Decentralized
Language Model Finetuning Using Shared Randomness [86.61582747039053]
Language model training in distributed settings is limited by the communication cost of exchanges.
We extend recent work using shared randomness to perform distributed fine-tuning with low bandwidth.
arXiv Detail & Related papers (2023-06-16T17:59:51Z) - Does compressing activations help model parallel training? [64.59298055364336]
We present the first empirical study on the effectiveness of compression methods for model parallelism.
We implement and evaluate three common classes of compression algorithms.
We evaluate these methods across more than 160 settings and 8 popular datasets.
arXiv Detail & Related papers (2023-01-06T18:58:09Z) - Prompt Tuning for Parameter-efficient Medical Image Segmentation [79.09285179181225]
We propose and investigate several contributions to achieve a parameter-efficient but effective adaptation for semantic segmentation on two medical imaging datasets.
We pre-train this architecture with a dedicated dense self-supervision scheme based on assignments to online generated prototypes.
We demonstrate that the resulting neural network model is able to attenuate the gap between fully fine-tuned and parameter-efficiently adapted models.
arXiv Detail & Related papers (2022-11-16T21:55:05Z) - ProgFed: Effective, Communication, and Computation Efficient Federated Learning by Progressive Training [65.68511423300812]
We propose ProgFed, a progressive training framework for efficient and effective federated learning.
ProgFed inherently reduces computation and two-way communication costs while maintaining the strong performance of the final models.
Our results show that ProgFed converges at the same rate as standard training on full models.
arXiv Detail & Related papers (2021-10-11T14:45:00Z) - Accelerating Distributed K-FAC with Smart Parallelism of Computing and
Communication Tasks [13.552262050816616]
Kronecker-Factored Approximate Curvature (KFAC) is one of the most efficient approximation algorithms for training deep models.
Yet, when leveraging GPU clusters to train models with KFAC, it incurs extensive computation as well as introduces extra communications during each iteration.
We propose D-KFAC with smart parallelism of computing and communication tasks to reduce the iteration time.
arXiv Detail & Related papers (2021-07-14T08:01:07Z) - Sparse-Push: Communication- & Energy-Efficient Decentralized Distributed
Learning over Directed & Time-Varying Graphs with non-IID Datasets [2.518955020930418]
We propose Sparse-Push, a communication efficient decentralized distributed training algorithm.
The proposed algorithm enables 466x reduction in communication with only 1% degradation in performance.
We demonstrate how communication compression can lead to significant performance degradation in-case of non-IID datasets.
arXiv Detail & Related papers (2021-02-10T19:41:11Z) - Coded Federated Learning [5.375775284252717]
Federated learning is a method of training a global model from decentralized data distributed across client devices.
Our results show that CFL allows the global model to converge nearly four times faster when compared to an uncoded approach.
arXiv Detail & Related papers (2020-02-21T23:06:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.