Espresso: Revisiting Gradient Compression from the System Perspective
- URL: http://arxiv.org/abs/2205.14465v1
- Date: Sat, 28 May 2022 15:47:00 GMT
- Title: Espresso: Revisiting Gradient Compression from the System Perspective
- Authors: Zhuang Wang, Haibin Lin, Yibo Zhu, T. S. Eugene Ng
- Abstract summary: Gradient compression (GC) is a promising approach to addressing the communication bottleneck in distributed deep learning (DDL)
However, it is challenging to find the optimal compression strategy for applying GC to DDL because of the intricate interactions among tensors.
Espresso is designed to express all compression strategies and the corresponding interactions among tensors of any DDL training job.
It can improve the training throughput over the start-of-the-art compression-enabled system by up to 77% for representative DDL training jobs.
- Score: 8.535644448611928
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Gradient compression (GC) is a promising approach to addressing the
communication bottleneck in distributed deep learning (DDL). However, it is
challenging to find the optimal compression strategy for applying GC to DDL
because of the intricate interactions among tensors. To fully unleash the
benefits of GC, two questions must be addressed: 1) How to express all
compression strategies and the corresponding interactions among tensors of any
DDL training job? 2) How to quickly select a near-optimal compression strategy?
In this paper, we propose Espresso to answer these questions. It first designs
a decision tree abstraction to express all the compression strategies and
develops empirical models to timeline tensor computation, communication, and
compression to enable Espresso to derive the intricate interactions among
tensors. It then designs a compression decision algorithm that analyzes tensor
interactions to eliminate and prioritize strategies and optimally offloads
compression to CPUs. Experimental evaluations show that Espresso can improve
the training throughput over the start-of-the-art compression-enabled system by
up to 77% for representative DDL training jobs. Moreover, the computational
time needed to select the compression strategy is measured in milliseconds, and
the selected strategy is only a few percent from optimal.
Related papers
- Order of Compression: A Systematic and Optimal Sequence to Combinationally Compress CNN [5.25545980258284]
We propose a systematic and optimal sequence to apply multiple compression techniques in the most effective order.
Our proposed Order of Compression significantly reduces computational costs by up to 859 times on ResNet34, with negligible accuracy loss.
We believe our simple yet effective exploration of the order of compression will shed light on the practice of model compression.
arXiv Detail & Related papers (2024-03-26T07:26:00Z) - Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence.
We find that gradients require milder compression rates than activations.
Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z) - Long Context Compression with Activation Beacon [22.054232261437186]
Activation Beacon is a plug-in module for transformer-based LLMs.
It targets effective, efficient, and flexible compression of long contexts.
It achieves a 2x acceleration in inference time and an 8x reduction of memory costs for KV cache.
arXiv Detail & Related papers (2024-01-07T11:57:40Z) - Lossy and Lossless (L$^2$) Post-training Model Size Compression [12.926354646945397]
We propose a post-training model size compression method that combines lossy and lossless compression in a unified way.
Our method can achieve a stable $10times$ compression ratio without sacrificing accuracy and a $20times$ compression ratio with minor accuracy loss in a short time.
arXiv Detail & Related papers (2023-08-08T14:10:16Z) - DiffRate : Differentiable Compression Rate for Efficient Vision
Transformers [98.33906104846386]
Token compression aims to speed up large-scale vision transformers (e.g. ViTs) by pruning (dropping) or merging tokens.
DiffRate is a novel token compression method that has several appealing properties prior arts do not have.
arXiv Detail & Related papers (2023-05-29T10:15:19Z) - Compressing Neural Networks: Towards Determining the Optimal Layer-wise
Decomposition [62.41259783906452]
We present a novel global compression framework for deep neural networks.
It automatically analyzes each layer to identify the optimal per-layer compression ratio.
Our results open up new avenues for future research into the global performance-size trade-offs of modern neural networks.
arXiv Detail & Related papers (2021-07-23T20:01:30Z) - Towards Compact CNNs via Collaborative Compression [166.86915086497433]
We propose a Collaborative Compression scheme, which joints channel pruning and tensor decomposition to compress CNN models.
We achieve 52.9% FLOPs reduction by removing 48.4% parameters on ResNet-50 with only a Top-1 accuracy drop of 0.56% on ImageNet 2012.
arXiv Detail & Related papers (2021-05-24T12:07:38Z) - PowerGossip: Practical Low-Rank Communication Compression in
Decentralized Deep Learning [62.440827696638664]
We introduce a simple algorithm that directly compresses the model differences between neighboring workers.
Inspired by the PowerSGD for centralized deep learning, this algorithm uses power steps to maximize the information transferred per bit.
arXiv Detail & Related papers (2020-08-04T09:14:52Z) - On Biased Compression for Distributed Learning [55.89300593805943]
We show for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings.
We propose several new biased compressors with promising theoretical guarantees and practical performance.
arXiv Detail & Related papers (2020-02-27T19:52:24Z) - Uncertainty Principle for Communication Compression in Distributed and
Federated Learning and the Search for an Optimal Compressor [5.09755285351264]
We consider an unbiased compression method inspired by the Kashin representation of vectors, which we call em Kashin compression (KC).
KC enjoys a em dimension independent variance bound for which we derive an explicit formula even in the regime when only a few bits need to be communicate per each vector entry.
arXiv Detail & Related papers (2020-02-20T17:20:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.