Related papers: Mask-Encoded Sparsification: Mitigating Biased Gradients in Communication-Efficient Split Learning

Mask-Encoded Sparsification: Mitigating Biased Gradients in Communication-Efficient Split Learning

URL: http://arxiv.org/abs/2408.13787v3
Date: Fri, 27 Sep 2024 03:07:05 GMT
Title: Mask-Encoded Sparsification: Mitigating Biased Gradients in Communication-Efficient Split Learning
Authors: Wenxuan Zhou, Zhihao Qu, Shen-Huan Lyu, Miao Cai, Baoliu Ye,
Abstract summary: This paper introduces a novel framework designed to achieve a high compression ratio in Split Learning (SL) scenarios. Our investigations demonstrate that compressing feature maps within SL leads to biased gradients that can negatively impact the convergence rates. We employ a narrow bit-width encoded mask to compensate for the sparsification error without increasing the order of time complexity.
Score: 15.78336840511033
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper introduces a novel framework designed to achieve a high compression ratio in Split Learning (SL) scenarios where resource-constrained devices are involved in large-scale model training. Our investigations demonstrate that compressing feature maps within SL leads to biased gradients that can negatively impact the convergence rates and diminish the generalization capabilities of the resulting models. Our theoretical analysis provides insights into how compression errors critically hinder SL performance, which previous methodologies underestimate. To address these challenges, we employ a narrow bit-width encoded mask to compensate for the sparsification error without increasing the order of time complexity. Supported by rigorous theoretical analysis, our framework significantly reduces compression errors and accelerates the convergence. Extensive experiments also verify that our method outperforms existing solutions regarding training efficiency and communication complexity.

Related papers

QuantVSR: Low-Bit Post-Training Quantization for Real-World Video Super-Resolution [53.13952833016505]
We propose a low-bit quantization model for real-world video super-resolution (VSR)<n>We use a calibration dataset to measure both spatial and temporal complexity for each layer.<n>We refine the FP and low-bit branches to achieve simultaneous optimization.
arXiv Detail & Related papers (2025-08-06T14:35:59Z)
Compressing Chain-of-Thought in LLMs via Step Entropy [12.576398947428988]
Large Language Models (LLMs) using Chain-of-Thought (CoT) prompting excel at complex reasoning but generate thought processes with considerable redundancy, leading to increased inference costs and reduced efficiency.<n>We introduce a novel CoT compression framework based on step entropy, a metric that quantifies the informational contribution of individual reasoning steps to identify redundancy.
arXiv Detail & Related papers (2025-08-05T11:48:18Z)
Progressive Alignment Degradation Learning for Pansharpening [3.7939736380306552]
Deep learning-based pansharpening has been shown to effectively generate high-resolution multispectral (HRMS) images.<n>The Wald protocol assumes that networks trained on artificial low-resolution data will perform equally well on high-resolution data.<n>We proposePADM, which uses mutual iteration between two sub-networks, PAlignNet and PDegradeNet, to adaptively learn accurate degradation processes.
arXiv Detail & Related papers (2025-06-25T07:07:32Z)
Contextual Compression Encoding for Large Language Models: A Novel Framework for Multi-Layered Parameter Space Pruning [0.0]
Contextual Compression. (CCE) introduced a multi-stage encoding mechanism that dynamically restructured parameter distributions. CCE retained linguistic expressivity and coherence, maintaining accuracy across a range of text generation and classification tasks.
arXiv Detail & Related papers (2025-02-12T11:44:19Z)
Theoretical Guarantees for Low-Rank Compression of Deep Neural Networks [5.582683296425384]
Deep neural networks have achieved state-of-the-art performance across numerous applications. Low-rank approximation techniques offer a promising solution by reducing the size and complexity of these networks. We develop an analytical framework for data-driven post-training low-rank compression.
arXiv Detail & Related papers (2025-02-04T23:10:13Z)
CALLIC: Content Adaptive Learning for Lossless Image Compression [64.47244912937204]
CALLIC sets a new state-of-the-art (SOTA) for learned lossless image compression. We propose a content-aware autoregressive self-attention mechanism by leveraging convolutional gating operations. During encoding, we decompose pre-trained layers, including depth-wise convolutions, using low-rank matrices and then adapt the incremental weights on testing image by Rate-guided Progressive Fine-Tuning (RPFT) RPFT fine-tunes with gradually increasing patches that are sorted in descending order by estimated entropy, optimizing learning process and reducing adaptation time.
arXiv Detail & Related papers (2024-12-23T10:41:18Z)
Content-decoupled Contrastive Learning-based Implicit Degradation Modeling for Blind Image Super-Resolution [33.16889233975723]
Implicit degradation modeling-based blind super-resolution (SR) has attracted more increasing attention in the community. We propose a new Content-decoupled Contrastive Learning-based blind image super-resolution (CdCL) framework.
arXiv Detail & Related papers (2024-08-10T04:51:43Z)
Generalized Nested Latent Variable Models for Lossy Coding applied to Wind Turbine Scenarios [14.48369551534582]
A learning-based approach seeks to minimize the compromise between compression rate and reconstructed image quality. A successful technique consists in introducing a deep hyperprior that operates within a 2-level nested latent variable model. This paper extends this concept by designing a generalized L-level nested generative model with a Markov chain structure.
arXiv Detail & Related papers (2024-06-10T11:00:26Z)
Adv-KD: Adversarial Knowledge Distillation for Faster Diffusion Sampling [2.91204440475204]
Diffusion Probabilistic Models (DPMs) have emerged as a powerful class of deep generative models. They rely on sequential denoising steps during sample generation. We propose a novel method that integrates denoising phases directly into the model's architecture.
arXiv Detail & Related papers (2024-05-31T08:19:44Z)
Improved Quantization Strategies for Managing Heavy-tailed Gradients in Distributed Learning [20.91559450517002]
It is observed that gradient distributions are heavy-tailed, with outliers significantly influencing the design of compression strategies. Existing parameter quantization methods experience performance degradation when this heavy-tailed feature is ignored. We introduce a novel compression scheme specifically engineered for heavy-tailed gradient gradients, which effectively combines truncation with quantization.
arXiv Detail & Related papers (2024-02-02T06:14:31Z)
EControl: Fast Distributed Optimization with Compression and Error Control [8.624830915051021]
We propose EControl, a novel mechanism that can regulate the strength of the feedback signal. We show that EControl mitigates the naive implementation of our method and support our findings.
arXiv Detail & Related papers (2023-11-06T10:00:13Z)
Quantize Once, Train Fast: Allreduce-Compatible Compression with Provable Guarantees [53.950234267704]
We introduce Global-QSGD, an All-reduce gradient-compatible quantization method.<n>We show that it accelerates distributed training by up to 3.51% over baseline quantization methods.
arXiv Detail & Related papers (2023-05-29T21:32:15Z)
ScaleCom: Scalable Sparsified Gradient Compression for Communication-Efficient Distributed Training [74.43625662170284]
Large-scale distributed training of Deep Neural Networks (DNNs) on state-of-the-art platforms is expected to be severely communication constrained. We propose a new compression technique that leverages similarity in the gradient distribution amongst learners to provide significantly improved scalability. We experimentally demonstrate that ScaleCom has small overheads, directly reduces gradient traffic and provides high compression rates (65-400X) and excellent scalability (up to 64 learners and 8-12X larger batch sizes over standard training) without significant accuracy loss.
arXiv Detail & Related papers (2021-04-21T02:22:10Z)
Step-Ahead Error Feedback for Distributed Training with Compressed Gradient [99.42912552638168]
We show that a new "gradient mismatch" problem is raised by the local error feedback in centralized distributed training. We propose two novel techniques, 1) step ahead and 2) error averaging, with rigorous theoretical analysis.
arXiv Detail & Related papers (2020-08-13T11:21:07Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
Accelerated Convergence for Counterfactual Learning to Rank [65.63997193915257]
We show that convergence rate of SGD approaches with IPS-weighted gradients suffers from the large variance introduced by the IPS weights. We propose a novel learning algorithm, called CounterSample, that has provably better convergence than standard IPS-weighted gradient descent methods. We prove that CounterSample converges faster and complement our theoretical findings with empirical results.
arXiv Detail & Related papers (2020-05-21T12:53:36Z)
Compressing Large Sample Data for Discriminant Analysis [78.12073412066698]
We consider the computational issues due to large sample size within the discriminant analysis framework. We propose a new compression approach for reducing the number of training samples for linear and quadratic discriminant analysis.
arXiv Detail & Related papers (2020-05-08T05:09:08Z)
Structured Sparsification with Joint Optimization of Group Convolution and Channel Shuffle [117.95823660228537]
We propose a novel structured sparsification method for efficient network compression. The proposed method automatically induces structured sparsity on the convolutional weights. We also address the problem of inter-group communication with a learnable channel shuffle mechanism.
arXiv Detail & Related papers (2020-02-19T12:03:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.