Rethinking the initialization of Momentum in Federated Learning with Heterogeneous Data
- URL: http://arxiv.org/abs/2411.19798v1
- Date: Fri, 29 Nov 2024 16:00:52 GMT
- Title: Rethinking the initialization of Momentum in Federated Learning with Heterogeneous Data
- Authors: Chenguang Xiao, Shuo Wang,
- Abstract summary: In this work, we propose a new way to calculate the estimated momentum used in local initialization.
The proposed method is named as Reversed Momentum Federated Learning (RMFL)
- Score: 5.922172844641853
- License:
- Abstract: Data Heterogeneity is a major challenge of Federated Learning performance. Recently, momentum based optimization techniques have beed proved to be effective in mitigating the heterogeneity issue. Along with the model updates, the momentum updates are transmitted to the server side and aggregated. Therefore, the local training initialized with a global momentum is guided by the global history of the gradients. However, we spot a problem in the traditional cumulation of the momentum which is suboptimal in the Federated Learning systems. The momentum used to weight less on the historical gradients and more on the recent gradients. This however, will engage more biased local gradients in the end of the local training. In this work, we propose a new way to calculate the estimated momentum used in local initialization. The proposed method is named as Reversed Momentum Federated Learning (RMFL). The key idea is to assign exponentially decayed weights to the gradients with the time going forward, which is on the contrary to the traditional momentum cumulation. The effectiveness of RMFL is evaluated on three popular benchmark datasets with different heterogeneity levels.
Related papers
- Normalization and effective learning rates in reinforcement learning [52.59508428613934]
Normalization layers have recently experienced a renaissance in the deep reinforcement learning and continual learning literature.
We show that normalization brings with it a subtle but important side effect: an equivalence between growth in the norm of the network parameters and decay in the effective learning rate.
We propose to make the learning rate schedule explicit with a simple re- parameterization which we call Normalize-and-Project.
arXiv Detail & Related papers (2024-07-01T20:58:01Z) - Aggregation Weighting of Federated Learning via Generalization Bound
Estimation [65.8630966842025]
Federated Learning (FL) typically aggregates client model parameters using a weighting approach determined by sample proportions.
We replace the aforementioned weighting method with a new strategy that considers the generalization bounds of each local model.
arXiv Detail & Related papers (2023-11-10T08:50:28Z) - The Marginal Value of Momentum for Small Learning Rate SGD [20.606430391298815]
Momentum is known to accelerate the convergence of gradient descent in strongly convex settings without gradient noise regimes.
Experiments show that momentum indeed has limited benefits for both optimization and generalization in practical training where the optimal learning rate is not very large.
arXiv Detail & Related papers (2023-07-27T21:01:26Z) - Understanding How Consistency Works in Federated Learning via Stage-wise
Relaxed Initialization [84.42306265220274]
Federated learning (FL) is a distributed paradigm that coordinates massive local clients to collaboratively train a global model.
Previous works have implicitly studied that FL suffers from the client-drift'' problem, which is caused by the inconsistent optimum across local clients.
To alleviate the negative impact of the client drift'' and explore its substance in FL, we first design an efficient FL algorithm textitFedInit.
arXiv Detail & Related papers (2023-06-09T06:55:15Z) - RIOT: Recursive Inertial Odometry Transformer for Localisation from
Low-Cost IMU Measurements [5.770538064283154]
We present two end-to-end frameworks for pose invariant deep inertial odometry that utilise self-attention to capture both spatial features and long-range dependencies in inertial data.
We evaluate our approaches against a custom 2-layer Gated Recurrent Unit, trained in the same manner on the same data, and tested each approach on a number of different users, devices and activities.
arXiv Detail & Related papers (2023-03-03T00:20:01Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - Towards understanding how momentum improves generalization in deep
learning [44.441873298005326]
We show that gradient descent with momentum (GD+M) significantly improves generalization compared to gradient descent (GD) in some deep learning problems.
Key insight in our analysis is that momentum is beneficial in datasets where the examples share some feature but differ in their margin.
arXiv Detail & Related papers (2022-07-13T02:39:08Z) - Accelerate Distributed Stochastic Descent for Nonconvex Optimization
with Momentum [12.324457683544132]
We propose a momentum method for such model averaging approaches.
We analyze the convergence and scaling properties of such momentum methods.
Our experimental results show that block momentum not only accelerates training, but also achieves better results.
arXiv Detail & Related papers (2021-10-01T19:23:18Z) - Correcting Momentum in Temporal Difference Learning [95.62766731469671]
We argue that momentum in Temporal Difference (TD) learning accumulates gradients that become doubly stale.
We show that this phenomenon exists, and then propose a first-order correction term to momentum.
An important insight of this work is that deep RL methods are not always best served by directly importing techniques from the supervised setting.
arXiv Detail & Related papers (2021-06-07T20:41:15Z) - Reconciling Modern Deep Learning with Traditional Optimization Analyses:
The Intrinsic Learning Rate [36.83448475700536]
Recent works suggest that the use of Batch Normalization in today's deep learning can move it far from a traditional optimization viewpoint.
This paper highlights other ways in which behavior of normalized nets departs from traditional viewpoints.
We name it the Fast Equilibrium Conjecture and suggest it holds the key to why Batch Normalization is effective.
arXiv Detail & Related papers (2020-10-06T17:58:29Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.