Related papers: AdLoCo: adaptive batching significantly improves communications efficiency and convergence for Large Language Models

AdLoCo: adaptive batching significantly improves communications efficiency and convergence for Large Language Models

URL: http://arxiv.org/abs/2508.18182v1
Date: Mon, 25 Aug 2025 16:35:57 GMT
Title: AdLoCo: adaptive batching significantly improves communications efficiency and convergence for Large Language Models
Authors: Nikolay Kutuzov, Makar Baderko, Stepan Kulibaba, Artem Dzhalilov, Daniel Bobrov, Maxim Mashtaler, Alexander Gasnikov,
Abstract summary: We propose a three-stage method that combines Multi-Instance Training (MIT), Adaptive Batched DiLoCo, and switch mode mechanism.<n>MIT allows individual nodes to run multiple lightweight training streams with different model instances in parallel.<n> Adaptive Batched DiLoCo dynamically adjusts local batch sizes to balance computation and communication.<n>Switch mode seamlessly introduces gradient accumulation once adaptive batch sizes grow beyond hardware-friendly limits.
Score: 34.54482364155804
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Scaling distributed training of Large Language Models (LLMs) requires not only algorithmic advances but also efficient utilization of heterogeneous hardware resources. While existing methods such as DiLoCo have demonstrated promising results, they often fail to fully exploit computational clusters under dynamic workloads. To address this limitation, we propose a three-stage method that combines Multi-Instance Training (MIT), Adaptive Batched DiLoCo, and switch mode mechanism. MIT allows individual nodes to run multiple lightweight training streams with different model instances in parallel and merge them to combine knowledge, increasing throughput and reducing idle time. Adaptive Batched DiLoCo dynamically adjusts local batch sizes to balance computation and communication, substantially lowering synchronization delays. Switch mode further stabilizes training by seamlessly introducing gradient accumulation once adaptive batch sizes grow beyond hardware-friendly limits. Together, these innovations improve both convergence speed and system efficiency. We also provide a theoretical estimate of the number of communications required for the full convergence of a model trained using our method.

Related papers

CollaPipe: Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks [57.95170323315603]
We introduce CollaPipe, a distributed learning framework that integrates collaborative pipeline parallelism with federated aggregation to support self-evolving networks.<n>In CollaPipe, the encoder part is adaptively partitioned into variable-sized segments and deployed across mobile devices for pipeline-parallel training, while the decoder is deployed on edge servers to handle generative tasks.<n>To enhance training efficiency, we formulate a joint optimization problem that adaptively allocates model segments, micro-batches, bandwidth, and transmission power.
arXiv Detail & Related papers (2025-09-24T07:54:01Z)
NoLoCo: No-all-reduce Low Communication Training Method for Large Models [0.310688583550805]
Training large language models is generally done via optimization methods on clusters containing tens of thousands of accelerators.<n>NoLoCo implicitly synchronizes model weights via a novel variant of the Nesterov momentum by partially averaging model weights with a randomly selected other one.<n>Our method requires significantly less communication overhead than fully sharded data parallel training or even widely used low communication training method, DiLoCo.
arXiv Detail & Related papers (2025-06-12T17:23:23Z)
Efficient Multi-modal Long Context Learning for Training-free Adaptation [96.21248144937627]
This paper introduces Efficient Multi-Modal Long Context Learning (EMLoC)<n>It embeds demonstration examples directly into the model input.<n>It condenses long-context multimodal inputs into compact, task-specific memory representations.
arXiv Detail & Related papers (2025-05-26T10:49:44Z)
AutoHete: An Automatic and Efficient Heterogeneous Training System for LLMs [68.99086112477565]
Transformer-based large language models (LLMs) have demonstrated exceptional capabilities in sequence modeling and text generation.<n>Existing heterogeneous training methods significantly expand the scale of trainable models but introduce substantial communication overheads and CPU workloads.<n>We propose AutoHete, an automatic and efficient heterogeneous training system compatible with both single- GPU and multi- GPU environments.
arXiv Detail & Related papers (2025-02-27T14:46:22Z)
High-Dimensional Distributed Sparse Classification with Scalable Communication-Efficient Global Updates [50.406127962933915]
We develop solutions to problems which enable us to learn a communication-efficient distributed logistic regression model. In our experiments we demonstrate a large improvement in accuracy over distributed algorithms with only a few distributed update steps needed.
arXiv Detail & Related papers (2024-07-08T19:34:39Z)
Communication-Efficient Distributed Deep Learning via Federated Dynamic Averaging [1.4748100900619232]
Federated Dynamic Averaging (FDA) is a communication-efficient DDL strategy. FDA reduces communication cost by orders of magnitude, compared to both traditional and cutting-edge algorithms.
arXiv Detail & Related papers (2024-05-31T16:34:11Z)
One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments [43.107261545706415]
Large Language Models (LLMs) have advanced rapidly but face significant memory demands. Current methods typically require lengthy training to alleviate the performance degradation from quantization loss. We make an initial attempt to extend the once-for-all framework to large language models.
arXiv Detail & Related papers (2024-05-30T16:05:15Z)
LoCoDL: Communication-Efficient Distributed Learning with Local Training and Compression [56.01900711954956]
We introduce LoCoDL, a communication-efficient algorithm that leverages the two popular and effective techniques of Local training, which reduces the communication frequency, and Compression, in which short bitstreams are sent instead of full-dimensional vectors of floats.<n>LoCoDL provably benefits from local training and compression and enjoys a doubly-accelerated communication complexity, with respect to the condition number of the functions and the model dimension, in the general heterogenous regime with strongly convex functions.
arXiv Detail & Related papers (2024-03-07T09:22:50Z)
Ravnest: Decentralized Asynchronous Training on Heterogeneous Devices [0.0]
Ravnest facilitates decentralized training by efficiently organizing compute nodes into clusters. We have framed our asynchronous SGD loss function as a block structured optimization problem with delayed updates.
arXiv Detail & Related papers (2024-01-03T13:07:07Z)
Asynchronous Parallel Incremental Block-Coordinate Descent for Decentralized Machine Learning [55.198301429316125]
Machine learning (ML) is a key technique for big-data-driven modelling and analysis of massive Internet of Things (IoT) based intelligent and ubiquitous computing. For fast-increasing applications and data amounts, distributed learning is a promising emerging paradigm since it is often impractical or inefficient to share/aggregate data. This paper studies the problem of training an ML model over decentralized systems, where data are distributed over many user devices.
arXiv Detail & Related papers (2022-02-07T15:04:15Z)
Adaptive Quantization of Model Updates for Communication-Efficient Federated Learning [75.45968495410047]
Communication of model updates between client nodes and the central aggregating server is a major bottleneck in federated learning. Gradient quantization is an effective way of reducing the number of bits required to communicate each model update. We propose an adaptive quantization strategy called AdaFL that aims to achieve communication efficiency as well as a low error floor.
arXiv Detail & Related papers (2021-02-08T19:14:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.