Related papers: Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy Compression

Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy Compression

URL: http://arxiv.org/abs/2407.04272v5
Date: Tue, 1 Oct 2024 05:20:59 GMT
Title: Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy Compression
Authors: Hao Feng, Boyuan Zhang, Fanjiang Ye, Min Si, Ching-Hsiang Chu, Jiannan Tian, Chunxing Yin, Summer Deng, Yuchen Hao, Pavan Balaji, Tong Geng, Dingwen Tao,
Abstract summary: DLRM is a state-of-the-art recommendation system model that has gained widespread adoption across various industry applications. A significant bottleneck in this process is the time-consuming all-to-all communication required to collect embedding data from all devices. We introduce a method that employs error-bounded lossy compression to reduce the communication data size and accelerate DLRM training.
Score: 10.233937665979694
License: http://creativecommons.org/licenses/by/4.0/
Abstract: DLRM is a state-of-the-art recommendation system model that has gained widespread adoption across various industry applications. The large size of DLRM models, however, necessitates the use of multiple devices/GPUs for efficient training. A significant bottleneck in this process is the time-consuming all-to-all communication required to collect embedding data from all devices. To mitigate this, we introduce a method that employs error-bounded lossy compression to reduce the communication data size and accelerate DLRM training. We develop a novel error-bounded lossy compression algorithm, informed by an in-depth analysis of embedding data features, to achieve high compression ratios. Moreover, we introduce a dual-level adaptive strategy for error-bound adjustment, spanning both table-wise and iteration-wise aspects, to balance the compression benefits with the potential impacts on accuracy. We further optimize our compressor for PyTorch tensors on GPUs, minimizing compression overhead. Evaluation shows that our method achieves a 1.38$\times$ training speedup with a minimal accuracy impact.

Related papers

EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation [84.70637613266835]
We re-formulate the model compression problem into the customized compensation problem. We propose Training-free Eigenspace Low-Rank Approximation (EoRA) EoRA directly minimizes compression-induced errors without requiring gradient-based training.
arXiv Detail & Related papers (2024-10-28T17:59:03Z)
LoCo: Low-Bit Communication Adaptor for Large-scale Model Training [63.040522637816906]
Low-bit communication often degrades training quality due to compression information loss. We propose Low-bit Communication Adaptor (LoCo), which compensates local local GPU nodes before, without compromising quality. Experimental results show that across moving large-scale training model frameworks like Megatron-LM and PyTorchs FSDP, LoCo significantly improves compression communication efficiency.
arXiv Detail & Related papers (2024-07-05T13:01:36Z)
Communication-Efficient Distributed Learning with Local Immediate Error Compensation [95.6828475028581]
We propose the Local Immediate Error Compensated SGD (LIEC-SGD) optimization algorithm. LIEC-SGD is superior to previous works in either the convergence rate or the communication cost.
arXiv Detail & Related papers (2024-02-19T05:59:09Z)
Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence. We find that gradients require milder compression rates than activations. Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z)
Long Context Compression with Activation Beacon [22.054232261437186]
Activation Beacon is a plug-in module for transformer-based LLMs. It targets effective, efficient, and flexible compression of long contexts. It achieves a 2x acceleration in inference time and an 8x reduction of memory costs for KV cache.
arXiv Detail & Related papers (2024-01-07T11:57:40Z)
GraVAC: Adaptive Compression for Communication-Efficient Distributed DL Training [0.0]
Distributed data-parallel (DDP) training improves overall application throughput as multiple devices train on a subset of data and aggregate updates to produce a globally shared model. GraVAC is a framework to dynamically adjust compression factor throughout training by evaluating model progress and assessing information loss associated with compression. As opposed to using a static compression factor, GraVAC reduces end-to-end training time for ResNet101, VGG16 and LSTM by 4.32x, 1.95x and 6.67x respectively.
arXiv Detail & Related papers (2023-05-20T14:25:17Z)
Optimal Rate Adaption in Federated Learning with Compressed Communications [28.16239232265479]
Federated Learning incurs high communication overhead, which can be greatly alleviated by compression for model updates. tradeoff between compression and model accuracy in the networked environment remains unclear. We present a framework to maximize the final model accuracy by strategically adjusting the compression each iteration.
arXiv Detail & Related papers (2021-12-13T14:26:15Z)
COMET: A Novel Memory-Efficient Deep Learning Training Framework by Using Error-Bounded Lossy Compression [8.080129426746288]
Training wide and deep neural networks (DNNs) require large amounts of storage resources such as memory. We propose a memory-efficient CNN training framework (called COMET) that leverages error-bounded lossy compression. Our framework can significantly reduce the training memory consumption by up to 13.5X over the baseline training and 1.8X over another state-of-the-art compression-based framework.
arXiv Detail & Related papers (2021-11-18T07:43:45Z)
Remote Multilinear Compressive Learning with Adaptive Compression [107.87219371697063]
MultiIoT Compressive Learning (MCL) is an efficient signal acquisition and learning paradigm for multidimensional signals. We propose a novel optimization scheme that enables such a feature for MCL models.
arXiv Detail & Related papers (2021-09-02T19:24:03Z)
Towards Compact CNNs via Collaborative Compression [166.86915086497433]
We propose a Collaborative Compression scheme, which joints channel pruning and tensor decomposition to compress CNN models. We achieve 52.9% FLOPs reduction by removing 48.4% parameters on ResNet-50 with only a Top-1 accuracy drop of 0.56% on ImageNet 2012.
arXiv Detail & Related papers (2021-05-24T12:07:38Z)
A Novel Memory-Efficient Deep Learning Training Framework via Error-Bounded Lossy Compression [6.069852296107781]
We propose a memory-driven high performance DNN training framework that leverages error-bounded lossy compression. Our framework can significantly reduce the training memory consumption by up to 13.5x and 1.8x over the baseline training and state-of-the-art framework with compression.
arXiv Detail & Related papers (2020-11-18T00:47:21Z)
PowerGossip: Practical Low-Rank Communication Compression in Decentralized Deep Learning [62.440827696638664]
We introduce a simple algorithm that directly compresses the model differences between neighboring workers. Inspired by the PowerSGD for centralized deep learning, this algorithm uses power steps to maximize the information transferred per bit.
arXiv Detail & Related papers (2020-08-04T09:14:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.