FuguReport

Local MixVR: Breaking the Communication-Sample Dependence in Distributed Learning

Authors Tehila Dahan, Bassel Hamoud, Roie Reshef, Martin Jaggi, Kfir Y. Levy
Affiliations Technion – Israel Institute of Technology / École Polytechnique Fédérale de Lausanne
Categories Method / Distributed Learning / Local MixVR framework, Evaluation / Optimization / SGD baseline performance comparison, Application / Noise Reduction / Reducing local noise in distributed settings
License CC BY 4.0

Abstract Overview

The paper studies distributed stochastic convex optimization under limited communication and introduces Local MixVR, a framework designed to reduce worker drift during local updates. The method combines local double-momentum updates, a budget split between local optimization and minibatch averaging, and a synchronization-time drift-correction step. The theoretical analysis argues that these components control local stochastic noise and decouple the required communication rounds from the total sample count N. The paper also presents experiments on MNIST and CIFAR-10 comparing communication rounds versus test accuracy.

Novelty

The main novelty is a distributed learning framework that is claimed to be the first to eliminate the dependence of communication complexity on the total number of samples N, making it depend only on the number of workers M. It achieves this by combining three variance-reduction mechanisms: local double-momentum, a hybrid local/minibatch budget allocation, and synchronization-time drift correction.

Results

Theoretical results establish a convergence bound that yields improved communication-round requirements relative to prior methods, particularly in the regime where the number of workers M is bounded by O(N^{1/4}). Empirically, image classification experiments on MNIST and CIFAR-10 demonstrate that Local MixVR outperforms Local SGD, Local Momentum, Minibatch SGD, and Minibatch ASGD across a range of communication-round budgets.

Key Points

  1. Local MixVR addresses worker drift by combining local double-momentum updates, minibatch averaging before synchronization, and an explicit drift-correction mechanism.
  2. The theoretical claim is that the method breaks communication complexity's dependence on the total sample size N, with required rounds scaling only with the number of workers M.
  3. Experiments on MNIST and CIFAR-10 show better test-accuracy-versus-communication-round performance than several distributed SGD baselines over the tested ranges.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.