Placement Semantics for Distributed Deep Learning: A Systematic Framework for Analyzing Parallelism Strategies
- URL: http://arxiv.org/abs/2601.02311v1
- Date: Mon, 05 Jan 2026 18:01:38 GMT
- Title: Placement Semantics for Distributed Deep Learning: A Systematic Framework for Analyzing Parallelism Strategies
- Authors: Deep Pankajbhai Mehta,
- Abstract summary: Training large language models requires distributing computation across many accelerators.<n>No unified systematic framework predicts their behavior.<n>We introduce placement semantics: each strategy is specified by how it places four training states across devices.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training large language models requires distributing computation across many accelerators, yet practitioners select parallelism strategies (data, tensor, pipeline, ZeRO) through trial and error because no unified systematic framework predicts their behavior. We introduce placement semantics: each strategy is specified by how it places four training states (parameters, optimizer, gradients, activations) across devices using five modes (replicated, sharded, sharded-with-gather, materialized, offloaded). From placement alone, without implementation details, we derive memory consumption and communication volume. Our predictions match published results exactly: ZeRO-3 uses 8x less memory than data parallelism at 1.5x communication cost, as reported in the original paper. We prove two conditions (gradient integrity, state consistency) are necessary and sufficient for distributed training to match single-device results, and provide composition rules for combining strategies safely. The framework unifies ZeRO Stages 1-3, Fully Sharded Data Parallel (FSDP), tensor parallelism, and pipeline parallelism as instances with different placement choices.
Related papers
- Learning to Share: Selective Memory for Efficient Parallel Agentic Systems [49.78267008828593]
Agentic systems solve complex tasks by coordinating multiple agents that iteratively reason, invoke tools, and exchange intermediate results.<n>Recent approaches deploy multiple agent teams running in parallel to explore diverse reasoning trajectories.<n>We propose Learning to Share (LTS), a learned shared-memory mechanism for parallel agentic frameworks.
arXiv Detail & Related papers (2026-02-05T18:20:21Z) - Plan before Solving: Problem-Aware Strategy Routing for Mathematical Reasoning with LLMs [49.995906301946]
Existing methods usually leverage a fixed strategy to guide Large Language Models (LLMs) to perform mathematical reasoning.<n>Our analysis reveals that the single strategy cannot adapt to problem-specific requirements and thus overlooks the trade-off between effectiveness and efficiency.<n>We propose Planning and Routing through Instance-Specific Modeling (PRISM), a novel framework that decouples mathematical reasoning into two stages: strategy planning and targeted execution.
arXiv Detail & Related papers (2025-09-29T07:22:41Z) - Imbalanced Regression Pipeline Recommendation [5.863538874435322]
This work proposes the Meta-learning for Imbalanced Regression (Meta-IR) framework.<n>It trains meta-classifiers to recommend the best pipeline composed of the resampling strategy and learning model per task.<n>Compared with AutoML frameworks, Meta-IR obtained better results.
arXiv Detail & Related papers (2025-07-16T04:34:02Z) - Model Parallelism With Subnetwork Data Parallelism [21.914077370806016]
Pre-training large neural networks at scale imposes heavy memory demands on accelerators and often requires costly communication.<n>We introduce Subnetwork Data Parallelism (SDP), a distributed training framework that partitions a model into structuredworks trained across workers without exchanging activations.<n>We study two complementary masking regimes: backward masking, which applies sparsity only in the backward step to retain unbiased gradients, and forward masking, which also removes parameters in the forward pass to deliver stronger efficiency gains.
arXiv Detail & Related papers (2025-07-11T21:25:11Z) - ByteScale: Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs [22.542224045868117]
We introduce ByteScale, an efficient framework for large-scale mixed training of long and short sequences.<n>ByteScale is based on Hybrid Data Parallelism (HDP), which unifies the inter- and intra-data partitioning with a dynamic mesh design.<n>Experiment results show that ByteScale outperforms the state-of-the-art training system by up to 7.89x.
arXiv Detail & Related papers (2025-02-28T17:01:03Z) - Accelerating Large Language Model Training with 4D Parallelism and Memory Consumption Estimator [4.953653137620666]
In large language model (LLM) training, several parallelization strategies, including Parallelism (TP), Pipeline Parallelism (PP), Data Parallelism (DP), are employed.
We provide precise formulas to estimate the memory consumed by parameters, gradients, states, and activations for 4D parallel training (DP, TP, PP, CP) in the Llama architecture.
Results indicate that when the estimated memory usage is below 80% of the available GPU memory, the training never encounters out-of-memory errors.
arXiv Detail & Related papers (2024-11-10T13:45:08Z) - UniPT: Universal Parallel Tuning for Transfer Learning with Efficient
Parameter and Memory [69.33445217944029]
PETL is an effective strategy for adapting pre-trained models to downstream domains.
Recent PETL works focus on the more valuable memory-efficient characteristic.
We propose a new memory-efficient PETL strategy, Universal Parallel Tuning (UniPT)
arXiv Detail & Related papers (2023-08-28T05:38:43Z) - Partitioning Distributed Compute Jobs with Reinforcement Learning and
Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields.
Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices.
We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z) - SWARM Parallelism: Training Large Models Can Be Surprisingly
Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters.
Training these models is notoriously expensive due to the need for specialized HPC clusters.
We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z) - Does compressing activations help model parallel training? [64.59298055364336]
We present the first empirical study on the effectiveness of compression methods for model parallelism.
We implement and evaluate three common classes of compression algorithms.
We evaluate these methods across more than 160 settings and 8 popular datasets.
arXiv Detail & Related papers (2023-01-06T18:58:09Z) - DistIR: An Intermediate Representation and Simulator for Efficient
Neural Network Distribution [15.086401550425125]
DistIR is a representation for distributed computation that is tailored for efficient analyses.
We show how DistIR and its simulator enable fast grid searches over complex distribution spaces spanning up to 1000+ configurations.
arXiv Detail & Related papers (2021-11-09T21:32:51Z) - Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel
Training [23.633810934134065]
Colossal-AI can achieve up to 2.76 times training speedup on large-scale models.
System supports parallel training methods such as data, pipeline, tensor, and sequence parallelism.
arXiv Detail & Related papers (2021-10-28T04:45:55Z) - Parallel Training of Deep Networks with Local Updates [84.30918922367442]
Local parallelism is a framework which parallelizes training of individual layers in deep networks by replacing global backpropagation with truncated layer-wise backpropagation.
We show results in both vision and language domains across a diverse set of architectures, and find that local parallelism is particularly effective in the high-compute regime.
arXiv Detail & Related papers (2020-12-07T16:38:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.