Related papers: DHP: Efficient Scaling of MLLM Training with Dynamic Hybrid Parallelism

DHP: Efficient Scaling of MLLM Training with Dynamic Hybrid Parallelism

URL: http://arxiv.org/abs/2602.21788v1
Date: Wed, 25 Feb 2026 11:11:53 GMT
Title: DHP: Efficient Scaling of MLLM Training with Dynamic Hybrid Parallelism
Authors: Yifan Niu, Han Xiao, Dongyi Liu, Wei Zhou, Jia Li,
Abstract summary: Dynamic Hybrid Parallelism (DHP) is an efficient strategy that adaptively reconfigures communication groups and parallelism during MLLM training.<n>DHP significantly outperforms Megatron-LM and DeepSpeed, achieving up to 1.36 $times$ speedup in training throughput.
Score: 14.539699026008746
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Scaling long-context capabilities is crucial for Multimodal Large Language Models (MLLMs). However, real-world multimodal datasets are extremely heterogeneous. Existing training frameworks predominantly rely on static parallelism strategies, which suffer from severe load imbalance, redundant communication, and suboptimal hardware utilization under data heterogeneity. In this work, we propose Dynamic Hybrid Parallelism (DHP), an efficient parallelism strategy that adaptively reconfigures communication groups and parallelism degrees during MLLM training. We generalize the non-power-of-two parallelism degrees and develop a polynomial-time algorithm to generate near-optimal parallelism strategies with only millisecond-level overhead per training batch. DHP is able to maintain high hardware efficiency even under extreme data variability. Experimental results demonstrate that DHP significantly outperforms Megatron-LM and DeepSpeed, achieving up to 1.36 $\times$ speedup in training throughput while maintaining near-linear scaling efficiency across large-scale NPU clusters.

Related papers

HyperET: Efficient Training in Hyperbolic Space for Multi-modal Large Language Models [50.31704374968706]
Multi-modal large language models (MLLMs) have emerged as a transformative approach for aligning visual and textual understanding.<n>They typically require extremely high computational resources for training to achieve cross-modal alignment at multi-granularity levels.<n>We argue that a key source of this inefficiency lies in the vision encoders they widely equip with, e.g., CLIP and SAM, which lack the alignment with language at multi-granularity levels.
arXiv Detail & Related papers (2025-10-23T08:16:44Z)
CollaPipe: Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks [57.95170323315603]
We introduce CollaPipe, a distributed learning framework that integrates collaborative pipeline parallelism with federated aggregation to support self-evolving networks.<n>In CollaPipe, the encoder part is adaptively partitioned into variable-sized segments and deployed across mobile devices for pipeline-parallel training, while the decoder is deployed on edge servers to handle generative tasks.<n>To enhance training efficiency, we formulate a joint optimization problem that adaptively allocates model segments, micro-batches, bandwidth, and transmission power.
arXiv Detail & Related papers (2025-09-24T07:54:01Z)
Two-dimensional Sparse Parallelism for Large Scale Deep Learning Recommendation Model Training [9.47829333855806]
In deep learning recommendation models (DLRM), the sparse embedding table is a crucial component for managing sparse categorical features.<n>We propose a novel two-dimensional sparse parallelism approach to overcome scalability challenges.<n>We show that the proposed approach significantly enhances training efficiency while maintaining model performance parity.
arXiv Detail & Related papers (2025-08-05T19:12:18Z)
Research on Model Parallelism and Data Parallelism Optimization Methods in Large Language Model-Based Recommendation Systems [6.453224262551299]
Large language models (LLMs) in recommendation systems have become increasingly prominent.<n>This paper systematically investigates two classes of optimization methods-model parallelism and data parallelism.<n> Experiments conducted on a real-world recommendation dataset in a simulated service environment demonstrate that our proposed hybrid parallelism scheme increases training throughput by over 30%.
arXiv Detail & Related papers (2025-06-21T02:37:25Z)
Hierarchical Balance Packing: Towards Efficient Supervised Fine-tuning for Long-Context LLM [49.2709992932292]
Training Long-Context Large Language Models (LLMs) is challenging, as hybrid training with long-context and short-context data often leads to workload imbalances.<n>Existing works mainly use data packing to alleviate this issue, but fail to consider imbalanced attention computation and wasted communication overhead.<n>This paper proposes Hierarchical Balance Packing (HBP), which designs a novel batch-construction method and training recipe to address those inefficiencies.
arXiv Detail & Related papers (2025-03-10T10:52:50Z)
In Situ Framework for Coupling Simulation and Machine Learning with Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations. As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks. This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z)
SplitBrain: Hybrid Data and Model Parallel Deep Learning [11.63431725146897]
This paper presents SplitBrain, a high performance distributed deep learning framework supporting hybrid data and model parallelism. Specifically, SplitBrain provides layer-specific partitioning that co-locates compute intensive convolutional layers while sharding memory demanding layers. Results show that SplitBrain can achieve nearly linear speedup while saving up to 67% of memory consumption for data and model parallel VGG over CIFAR-10.
arXiv Detail & Related papers (2021-12-31T06:25:38Z)
Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training [23.633810934134065]
Colossal-AI can achieve up to 2.76 times training speedup on large-scale models. System supports parallel training methods such as data, pipeline, tensor, and sequence parallelism.
arXiv Detail & Related papers (2021-10-28T04:45:55Z)
Parallel Training of Deep Networks with Local Updates [84.30918922367442]
Local parallelism is a framework which parallelizes training of individual layers in deep networks by replacing global backpropagation with truncated layer-wise backpropagation. We show results in both vision and language domains across a diverse set of architectures, and find that local parallelism is particularly effective in the high-compute regime.
arXiv Detail & Related papers (2020-12-07T16:38:45Z)
Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods. We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods. Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.