Related papers: Novel Gradient Sparsification Algorithm via Bayesian Inference

Novel Gradient Sparsification Algorithm via Bayesian Inference

URL: http://arxiv.org/abs/2409.14893v1
Date: Mon, 23 Sep 2024 10:42:34 GMT
Title: Novel Gradient Sparsification Algorithm via Bayesian Inference
Authors: Ali Bereyhi, Ben Liang, Gary Boudreau, Ali Afana,
Abstract summary: This paper proposes a novel sparsification algorithm called regularized Top-$k$ (RegTop-$k$) that controls the learning rate scaling of error accumulation. Numerical experiments with ResNet-18 on CIFAR-10 show that RegTop-$k$ achieves about $8%$ higher accuracy than standard Top-$k$.
Score: 27.246907664193156
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Error accumulation is an essential component of the Top-$k$ sparsification method in distributed gradient descent. It implicitly scales the learning rate and prevents the slow-down of lateral movement, but it can also deteriorate convergence. This paper proposes a novel sparsification algorithm called regularized Top-$k$ (RegTop-$k$) that controls the learning rate scaling of error accumulation. The algorithm is developed by looking at the gradient sparsification as an inference problem and determining a Bayesian optimal sparsification mask via maximum-a-posteriori estimation. It utilizes past aggregated gradients to evaluate posterior statistics, based on which it prioritizes the local gradient entries. Numerical experiments with ResNet-18 on CIFAR-10 show that at $0.1\%$ sparsification, RegTop-$k$ achieves about $8\%$ higher accuracy than standard Top-$k$.

Related papers

Regularized Top-$k$: A Bayesian Framework for Gradient Sparsification [27.246907664193156]
This work proposes a novel sparsification scheme that controls the learning rate scaling of error accumulation. Using the prior distribution inherited from Top-$k$, we derive a new sparsification algorithm which can be interpreted as a regularized form of Top-$k$.
arXiv Detail & Related papers (2025-01-10T00:32:46Z)
Modified Step Size for Enhanced Stochastic Gradient Descent: Convergence and Experiments [0.0]
This paper introduces a novel approach to the performance of the gradient descent (SGD) algorithm by incorporating a modified decay step size based on $frac1sqrttt. The proposed step size integrates a logarithmic step term, leading to the selection of smaller values in the final iteration. To the effectiveness of our approach, we conducted numerical experiments on image classification tasks using the FashionMNIST, andARAR datasets.
arXiv Detail & Related papers (2023-09-03T19:21:59Z)
Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks. We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights. Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z)
SIMPLE: A Gradient Estimator for $k$-Subset Sampling [42.38652558807518]
In this work, we fall back to discrete $k$-subset sampling on the forward pass. We show that our gradient estimator, SIMPLE, exhibits lower bias and variance compared to state-of-the-art estimators. Empirical results show improved performance on learning to explain and sparse linear regression.
arXiv Detail & Related papers (2022-10-04T22:33:16Z)
Momentum-Based Policy Gradient with Second-Order Information [40.51117836892182]
We propose a variance-reduced policy-gradient method, called SHARP, which incorporates second-order information into gradient descent. Unlike most previous work, our proposed algorithm does not require importance sampling which can compromise the advantage of variance reduction process. Our extensive experimental evaluations show the effectiveness of the proposed algorithm on various control tasks and its advantage over the state of the art in practice.
arXiv Detail & Related papers (2022-05-17T11:56:50Z)
Fast Gradient Non-sign Methods [67.56549792690706]
Fast Gradient Non-sign Method (FGNM) is a general routine, which can seamlessly replace the conventional $sign$ operation in gradient-based attacks. Our methods outperform them by textbf27.5% at most and textbf9.5% on average.
arXiv Detail & Related papers (2021-10-25T08:46:00Z)
Gradient Boosted Binary Histogram Ensemble for Large-scale Regression [60.16351608335641]
We propose a gradient boosting algorithm for large-scale regression problems called textitGradient Boosted Binary Histogram Ensemble (GBBHE) based on binary histogram partition and ensemble learning. In the experiments, compared with other state-of-the-art algorithms such as gradient boosted regression tree (GBRT), our GBBHE algorithm shows promising performance with less running time on large-scale datasets.
arXiv Detail & Related papers (2021-06-03T17:05:40Z)
Stochastic Gradient Variance Reduction by Solving a Filtering Problem [0.951828574518325]
Deep neural networks (DNN) are typically using optimized gradient descent (SGD) The estimation of the gradient using samples tends to be noisy and unreliable, resulting in large gradient variance and bad convergence. We propose textbfFilter Gradient Decent(FGD), an efficient optimization algorithm that makes the consistent estimation of gradient.
arXiv Detail & Related papers (2020-12-22T23:48:42Z)
Reparametrizing gradient descent [0.0]
We propose an optimization algorithm which we call norm-adapted gradient descent. Our algorithm can also be compared to quasi-Newton methods, but we seek roots rather than stationary points.
arXiv Detail & Related papers (2020-10-09T20:22:29Z)
Carath\'eodory Sampling for Stochastic Gradient Descent [79.55586575988292]
We present an approach that is inspired by classical results of Tchakaloff and Carath'eodory about measure reduction. We adaptively select the descent steps where the measure reduction is carried out. We combine this with Block Coordinate Descent so that measure reduction can be done very cheaply.
arXiv Detail & Related papers (2020-06-02T17:52:59Z)
Variance Reduction with Sparse Gradients [82.41780420431205]
Variance reduction methods such as SVRG and SpiderBoost use a mixture of large and small batch gradients. We introduce a new sparsity operator: The random-top-k operator. Our algorithm consistently outperforms SpiderBoost on various tasks including image classification, natural language processing, and sparse matrix factorization.
arXiv Detail & Related papers (2020-01-27T08:23:58Z)
Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks. In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems. Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.