Online Model Compression for Federated Learning with Large Models
- URL: http://arxiv.org/abs/2205.03494v1
- Date: Fri, 6 May 2022 22:43:03 GMT
- Title: Online Model Compression for Federated Learning with Large Models
- Authors: Tien-Ju Yang, Yonghui Xiao, Giovanni Motta, Fran\c{c}oise Beaufays,
Rajiv Mathews, Mingqing Chen
- Abstract summary: Online Model Compression (OMC) is a framework that stores model parameters in a compressed format and decompresses them only when needed.
OMC can reduce memory usage and communication cost of model parameters by up to 59% while attaining comparable accuracy and training speed when compared with full-precision training.
- Score: 8.48327410170884
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper addresses the challenges of training large neural network models
under federated learning settings: high on-device memory usage and
communication cost. The proposed Online Model Compression (OMC) provides a
framework that stores model parameters in a compressed format and decompresses
them only when needed. We use quantization as the compression method in this
paper and propose three methods, (1) using per-variable transformation, (2)
weight matrices only quantization, and (3) partial parameter quantization, to
minimize the impact on model accuracy. According to our experiments on two
recent neural networks for speech recognition and two different datasets, OMC
can reduce memory usage and communication cost of model parameters by up to 59%
while attaining comparable accuracy and training speed when compared with
full-precision training.
Related papers
- Variational autoencoder-based neural network model compression [4.992476489874941]
Variational Autoencoders (VAEs), as a form of deep generative model, have been widely used in recent years.
This paper aims to explore neural network model compression method based on VAE.
arXiv Detail & Related papers (2024-08-25T09:06:22Z) - MCNC: Manifold Constrained Network Compression [21.70510507535041]
We present MCNC as a novel model compression method that constrains the parameter space to low-dimensional pre-defined and frozen nonlinear manifold.
We show that our method, MCNC, significantly outperforms state-of-the-art baselines in terms of compression, accuracy, and/or model reconstruction time.
arXiv Detail & Related papers (2024-06-27T16:17:26Z) - A Survey on Transformer Compression [84.18094368700379]
Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV)
Model compression methods reduce the memory and computational cost of Transformer.
This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to Transformer-based models.
arXiv Detail & Related papers (2024-02-05T12:16:28Z) - Retraining-free Model Quantization via One-Shot Weight-Coupling Learning [41.299675080384]
Mixed-precision quantization (MPQ) is advocated to compress the model effectively by allocating heterogeneous bit-width for layers.
MPQ is typically organized into a searching-retraining two-stage process.
In this paper, we devise a one-shot training-searching paradigm for mixed-precision model compression.
arXiv Detail & Related papers (2024-01-03T05:26:57Z) - Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO)
MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts.
Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z) - Deep Hierarchy Quantization Compression algorithm based on Dynamic
Sampling [11.439540966972212]
Federated machine learning stores data locally for training and aggregates the models on the server.
During the training process, the transmission of model parameters can impose a significant load on the network bandwidth.
We propose a deep hierarchical quantization compression algorithm, which further compresses the model and reduces the network load brought by data transmission.
arXiv Detail & Related papers (2022-12-30T15:12:30Z) - Deep learning model compression using network sensitivity and gradients [3.52359746858894]
We present model compression algorithms for both non-retraining and retraining conditions.
In the first case, we propose the Bin & Quant algorithm for compression of the deep learning models using the sensitivity of the network parameters.
In the second case, we propose our novel gradient-weighted k-means clustering algorithm (GWK)
arXiv Detail & Related papers (2022-10-11T03:02:40Z) - Dynamically-Scaled Deep Canonical Correlation Analysis [77.34726150561087]
Canonical Correlation Analysis (CCA) is a method for feature extraction of two views by finding maximally correlated linear projections of them.
We introduce a novel dynamic scaling method for training an input-dependent canonical correlation model.
arXiv Detail & Related papers (2022-03-23T12:52:49Z) - Mixed Precision Low-bit Quantization of Neural Network Language Models
for Speech Recognition [67.95996816744251]
State-of-the-art language models (LMs) represented by long-short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming increasingly complex and expensive for practical applications.
Current quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of LMs to quantization errors.
Novel mixed precision neural network LM quantization methods are proposed in this paper.
arXiv Detail & Related papers (2021-11-29T12:24:02Z) - MoEfication: Conditional Computation of Transformer Models for Efficient
Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z) - Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size.
A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator.
In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.