Quantized Adaptive Subgradient Algorithms and Their Applications
- URL: http://arxiv.org/abs/2208.05631v1
- Date: Thu, 11 Aug 2022 04:04:03 GMT
- Title: Quantized Adaptive Subgradient Algorithms and Their Applications
- Authors: Ke Xu, Jianqiao Wangni, Yifan Zhang, Deheng Ye, Jiaxiang Wu and Peilin
Zhao
- Abstract summary: We propose quantized composite mirror descent adaptive subgradient (QCMD adagrad) and quantized regularized dual average adaptive subgradient (QRDA adagrad) for distributed training.
A quantized gradient-based adaptive learning rate matrix is constructed to achieve a balance between communication costs, accuracy, and model sparsity.
- Score: 39.103587572626026
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data explosion and an increase in model size drive the remarkable advances in
large-scale machine learning, but also make model training time-consuming and
model storage difficult. To address the above issues in the distributed model
training setting which has high computation efficiency and less device
limitation, there are still two main difficulties. On one hand, the
communication costs for exchanging information, e.g., stochastic gradients
among different workers, is a key bottleneck for distributed training
efficiency. On the other hand, less parameter model is easy for storage and
communication, but the risk of damaging the model performance. To balance the
communication costs, model capacity and model performance simultaneously, we
propose quantized composite mirror descent adaptive subgradient (QCMD adagrad)
and quantized regularized dual average adaptive subgradient (QRDA adagrad) for
distributed training. To be specific, we explore the combination of gradient
quantization and sparse model to reduce the communication cost per iteration in
distributed training. A quantized gradient-based adaptive learning rate matrix
is constructed to achieve a balance between communication costs, accuracy, and
model sparsity. Moreover, we theoretically find that a large quantization error
brings in extra noise, which influences the convergence and sparsity of the
model. Therefore, a threshold quantization strategy with a relatively small
error is adopted in QCMD adagrad and QRDA adagrad to improve the
signal-to-noise ratio and preserve the sparsity of the model. Both theoretical
analyses and empirical results demonstrate the efficacy and efficiency of the
proposed algorithms.
Related papers
- Clipped Uniform Quantizers for Communication-Efficient Federated Learning [3.38220960870904]
This paper introduces an approach to employ clipped uniform quantization in federated learning settings.
By employing optimal clipping thresholds and adaptive quantization schemes, our method significantly curtails the bit requirements for model weight transmissions.
arXiv Detail & Related papers (2024-05-22T05:48:25Z) - EsaCL: Efficient Continual Learning of Sparse Models [10.227171407348326]
Key challenge in the continual learning setting is to efficiently learn a sequence of tasks without forgetting how to perform previously learned tasks.
We propose a new method for efficient continual learning of sparse models (EsaCL) that can automatically prune redundant parameters without adversely impacting the model's predictive power.
arXiv Detail & Related papers (2024-01-11T04:59:44Z) - Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST)
IST is a recently proposed and highly effective technique for solving the aforementioned problems.
We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z) - Fundamental Limits of Communication Efficiency for Model Aggregation in
Distributed Learning: A Rate-Distortion Approach [54.311495894129585]
We study the limit of communication cost of model aggregation in distributed learning from a rate-distortion perspective.
It is found that the communication gain by exploiting the correlation between worker nodes is significant for SignSGD.
arXiv Detail & Related papers (2022-06-28T13:10:40Z) - ClusterQ: Semantic Feature Distribution Alignment for Data-Free
Quantization [111.12063632743013]
We propose a new and effective data-free quantization method termed ClusterQ.
To obtain high inter-class separability of semantic features, we cluster and align the feature distribution statistics.
We also incorporate the intra-class variance to solve class-wise mode collapse.
arXiv Detail & Related papers (2022-04-30T06:58:56Z) - Adaptive Quantization of Model Updates for Communication-Efficient
Federated Learning [75.45968495410047]
Communication of model updates between client nodes and the central aggregating server is a major bottleneck in federated learning.
Gradient quantization is an effective way of reducing the number of bits required to communicate each model update.
We propose an adaptive quantization strategy called AdaFL that aims to achieve communication efficiency as well as a low error floor.
arXiv Detail & Related papers (2021-02-08T19:14:21Z) - MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability.
We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error.
Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.