Local MixVR: Breaking the Communication-Sample Dependence in Distributed Learning
Abstract Overview
The paper studies distributed stochastic convex optimization under limited communication and introduces Local MixVR, a framework designed to reduce worker drift during local updates. The method combines local double-momentum updates, a budget split between local optimization and minibatch averaging, and a synchronization-time drift-correction step. The theoretical analysis argues that these components control local stochastic noise and decouple the required communication rounds from the total sample count N. The paper also presents experiments on MNIST and CIFAR-10 comparing communication rounds versus test accuracy.
Novelty
The main novelty is a distributed learning framework that is claimed to be the first to eliminate the dependence of communication complexity on the total number of samples N, making it depend only on the number of workers M. It achieves this by combining three variance-reduction mechanisms: local double-momentum, a hybrid local/minibatch budget allocation, and synchronization-time drift correction.
Results
Theoretical results establish a convergence bound that yields improved communication-round requirements relative to prior methods, particularly in the regime where the number of workers M is bounded by O(N^{1/4}). Empirically, image classification experiments on MNIST and CIFAR-10 demonstrate that Local MixVR outperforms Local SGD, Local Momentum, Minibatch SGD, and Minibatch ASGD across a range of communication-round budgets.
Key Points
- Local MixVR addresses worker drift by combining local double-momentum updates, minibatch averaging before synchronization, and an explicit drift-correction mechanism.
- The theoretical claim is that the method breaks communication complexity's dependence on the total sample size N, with required rounds scaling only with the number of workers M.
- Experiments on MNIST and CIFAR-10 show better test-accuracy-versus-communication-round performance than several distributed SGD baselines over the tested ranges.