Related papers: DiPaCo: Distributed Path Composition

DiPaCo: Distributed Path Composition

URL: http://arxiv.org/abs/2403.10616v1
Date: Fri, 15 Mar 2024 18:26:51 GMT
Title: DiPaCo: Distributed Path Composition
Authors: Arthur Douillard, Qixuan Feng, Andrei A. Rusu, Adhiguna Kuncoro, Yani Donchev, Rachita Chhaparia, Ionel Gog, Marc'Aurelio Ranzato, Jiajun Shen, Arthur Szlam,
Abstract summary: We propose a co-designed modular architecture and training approach for machine learning models. During training, DiPaCo distributes by paths through a set of shared modules. At inference time, only a single path needs to be executed for each input, without the need for model compression.
Score: 31.686642863608558
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Progress in machine learning (ML) has been fueled by scaling neural network models. This scaling has been enabled by ever more heroic feats of engineering, necessary for accommodating ML approaches that require high bandwidth communication between devices working in parallel. In this work, we propose a co-designed modular architecture and training approach for ML models, dubbed DIstributed PAth COmposition (DiPaCo). During training, DiPaCo distributes computation by paths through a set of shared modules. Together with a Local-SGD inspired optimization (DiLoCo) that keeps modules in sync with drastically reduced communication, Our approach facilitates training across poorly connected and heterogeneous workers, with a design that ensures robustness to worker failures and preemptions. At inference time, only a single path needs to be executed for each input, without the need for any model compression. We consider this approach as a first prototype towards a new paradigm of large-scale learning, one that is less synchronous and more modular. Our experiments on the widely used C4 benchmark show that, for the same amount of training steps but less wall-clock time, DiPaCo exceeds the performance of a 1 billion-parameter dense transformer language model by choosing one of 256 possible paths, each with a size of 150 million parameters.

Related papers

MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling [80.48332380100915]
MiniCPM-SALA is a hybrid model that integrates the high-fidelity long-context modeling of sparse attention with the global efficiency of linear attention.<n>On a single NVIDIA A6000D GPU, the model achieves up to 3.5x the inference speed of the full-attention model at the sequence length of 256K tokens.
arXiv Detail & Related papers (2026-02-12T09:37:05Z)
MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core [11.40633051522406]
We introduce an end-to-end training framework for large-scale MoE models. MoE Parallel Folding is a novel strategy that decouples the parallelization of attention and MoE in Transformer models. A flexible token-level dispatcher supports both token-dropping and token-dropless MoE training.
arXiv Detail & Related papers (2025-04-21T08:39:47Z)
AutoHete: An Automatic and Efficient Heterogeneous Training System for LLMs [68.99086112477565]
Transformer-based large language models (LLMs) have demonstrated exceptional capabilities in sequence modeling and text generation. Existing heterogeneous training methods significantly expand the scale of trainable models but introduce substantial communication overheads and CPU workloads. We propose AutoHete, an automatic and efficient heterogeneous training system compatible with both single- GPU and multi- GPU environments.
arXiv Detail & Related papers (2025-02-27T14:46:22Z)
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE. Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z)
Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules [15.680276212483292]
We propose Parm, a system that accelerates MP+EP+ESP training by designing two dedicated schedules for placing communication tasks. Parm achieves 1.13$times$ to 5.77$times$ speedup on 1296 manually configured MoE layers and approximately 3$times$ improvement on two real-world MoE models.
arXiv Detail & Related papers (2024-06-30T05:55:11Z)
Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts [4.629608387540524]
ScMoE is a novel shortcut-connected MoE architecture integrated with an overlapping parallelization strategy.<n>Compared to the prevalent top-2 MoE baseline, ScMoE achieves speedups of 1.49 times in training and 1.82 times in inference.
arXiv Detail & Related papers (2024-04-07T17:17:23Z)
ATOM: Asynchronous Training of Massive Models for Deep Learning in a Decentralized Environment [7.916080032572087]
atom is a resilient distributed training framework designed for asynchronous training of vast models in a decentralized setting. atom aims to accommodate a complete LLM on one host (peer) through seamlessly model swapping and concurrently trains multiple copies across various peers to optimize training throughput. Our experiments using different GPT-3 model configurations reveal that, in scenarios with suboptimal network connections, atom can enhance training efficiency up to $20 times$ when juxtaposed with the state-of-the-art decentralized pipeline parallelism approaches.
arXiv Detail & Related papers (2024-03-15T17:43:43Z)
MatFormer: Nested Transformer for Elastic Inference [94.1789252941718]
MatFormer is a nested Transformer architecture designed to offer elasticity in a variety of deployment constraints. We show that a 2.6B decoder-only MatFormer language model (MatLM) allows us to extract smaller models spanning from 1.5B to 2.6B. We also observe that smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval.
arXiv Detail & Related papers (2023-10-11T17:57:14Z)
eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception. Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency. We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z)
SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters. Training these models is notoriously expensive due to the need for specialized HPC clusters. We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z)
Does compressing activations help model parallel training? [64.59298055364336]
We present the first empirical study on the effectiveness of compression methods for model parallelism. We implement and evaluate three common classes of compression algorithms. We evaluate these methods across more than 160 settings and 8 popular datasets.
arXiv Detail & Related papers (2023-01-06T18:58:09Z)
Deep Model Assembling [31.88606253639418]
This paper studies a divide-and-conquer strategy to train large models. It divides a large model into smaller modules, training them independently, and reassembling the trained modules to obtain the target model. We introduce a global, shared meta model to implicitly link all the modules together. This enables us to train highly compatible modules that collaborate effectively when they are assembled together.
arXiv Detail & Related papers (2022-12-08T08:04:06Z)
Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training [23.633810934134065]
Colossal-AI can achieve up to 2.76 times training speedup on large-scale models. System supports parallel training methods such as data, pipeline, tensor, and sequence parallelism.
arXiv Detail & Related papers (2021-10-28T04:45:55Z)
Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods. We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods. Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.