Research on Model Parallelism and Data Parallelism Optimization Methods in Large Language Model-Based Recommendation Systems
- URL: http://arxiv.org/abs/2506.17551v2
- Date: Tue, 24 Jun 2025 02:28:50 GMT
- Title: Research on Model Parallelism and Data Parallelism Optimization Methods in Large Language Model-Based Recommendation Systems
- Authors: Haowei Yang, Yu Tian, Zhongheng Yang, Zhao Wang, Chengrui Zhou, Dannier Li,
- Abstract summary: Large language models (LLMs) in recommendation systems have become increasingly prominent.<n>This paper systematically investigates two classes of optimization methods-model parallelism and data parallelism.<n> Experiments conducted on a real-world recommendation dataset in a simulated service environment demonstrate that our proposed hybrid parallelism scheme increases training throughput by over 30%.
- Score: 6.453224262551299
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the rapid adoption of large language models (LLMs) in recommendation systems, the computational and communication bottlenecks caused by their massive parameter sizes and large data volumes have become increasingly prominent. This paper systematically investigates two classes of optimization methods-model parallelism and data parallelism-for distributed training of LLMs in recommendation scenarios. For model parallelism, we implement both tensor parallelism and pipeline parallelism, and introduce an adaptive load-balancing mechanism to reduce cross-device communication overhead. For data parallelism, we compare synchronous and asynchronous modes, combining gradient compression and sparsification techniques with an efficient aggregation communication framework to significantly improve bandwidth utilization. Experiments conducted on a real-world recommendation dataset in a simulated service environment demonstrate that our proposed hybrid parallelism scheme increases training throughput by over 30% and improves resource utilization by approximately 20% compared to traditional single-mode parallelism, while maintaining strong scalability and robustness. Finally, we discuss trade-offs among different parallel strategies in online deployment and outline future directions involving heterogeneous hardware integration and automated scheduling technologies.
Related papers
- DHP: Efficient Scaling of MLLM Training with Dynamic Hybrid Parallelism [14.539699026008746]
Dynamic Hybrid Parallelism (DHP) is an efficient strategy that adaptively reconfigures communication groups and parallelism during MLLM training.<n>DHP significantly outperforms Megatron-LM and DeepSpeed, achieving up to 1.36 $times$ speedup in training throughput.
arXiv Detail & Related papers (2026-02-25T11:11:53Z) - Distributed Hybrid Parallelism for Large Language Models: Comparative Study and System Design Guide [15.92814573525633]
This paper offers a comprehensive review of collective operations and distributed parallel strategies.<n>We examine hybrid parallelization designs, emphasizing communication overlap across different stages of model deployment.<n>We highlight open challenges and limitations of current LLM training paradigms and outline promising directions for the next generation of large scale model development.
arXiv Detail & Related papers (2026-02-09T19:01:13Z) - AsyncMesh: Fully Asynchronous Optimization for Data and Pipeline Parallelism [54.8494905524997]
We introduce asynchronous updates across both parallelism axes, relaxing the co-location requirement.<n>We provide convergence guarantees for both sparse averaging and asynchronous updates.<n>Experiments on large-scale language models demonstrate that our approach matches the performance of the fully synchronous baseline.
arXiv Detail & Related papers (2026-01-30T01:24:47Z) - Training Report of TeleChat3-MoE [77.94641922160359]
This technical report mainly presents the underlying training infrastructure that enables reliable and efficient scaling to frontier model sizes.<n>We detail systematic methodologies for operator-level and end-to-end numerical verification accuracy, ensuring consistency across hardware platforms.<n>A systematic parallelization framework, leveraging analytical estimation and integer linear programming, is also proposed to optimize multi-dimensional parallelism configurations.
arXiv Detail & Related papers (2025-12-30T11:42:14Z) - Two-dimensional Sparse Parallelism for Large Scale Deep Learning Recommendation Model Training [9.47829333855806]
In deep learning recommendation models (DLRM), the sparse embedding table is a crucial component for managing sparse categorical features.<n>We propose a novel two-dimensional sparse parallelism approach to overcome scalability challenges.<n>We show that the proposed approach significantly enhances training efficiency while maintaining model performance parity.
arXiv Detail & Related papers (2025-08-05T19:12:18Z) - Rethinking Dynamic Networks and Heterogeneous Computing with Automatic Parallelization [8.918295350787465]
Current automatic parallel planning frameworks overlook the simultaneous consideration of node heterogeneity and dynamic network topology changes.<n>We introduce a strategy pruning technique to rapidly discard infeasible parallel configurations.<n>Preliminary evaluations confirm that our method notably enhances training performance on heterogeneous nodes.
arXiv Detail & Related papers (2025-06-03T12:14:17Z) - Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism [59.79227116582264]
Scaling models has led to significant advancements in deep learning, but training these models in decentralized settings remains challenging.<n>We propose a novel compression algorithm that compresses both forward and backward passes, enabling up to 99% compression with no convergence degradation.
arXiv Detail & Related papers (2025-06-02T02:19:22Z) - Improving Automatic Parallel Training via Balanced Memory Workload
Optimization [36.87527680184956]
Transformer models have emerged as the leading approach for achieving state-of-the-art performance across various application domains.
We present Galvatron-BMW, a novel system framework that integrates multiple parallelism prevalent dimensions and automatically identifies the most efficient hybrid parallelism strategy.
Our evaluations on different Transformer models demonstrate the capabilities of Galvatron-BMW in automating distributed training under varying GPU memory constraints.
arXiv Detail & Related papers (2023-07-05T05:28:38Z) - Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST)
IST is a recently proposed and highly effective technique for solving the aforementioned problems.
We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z) - Does compressing activations help model parallel training? [64.59298055364336]
We present the first empirical study on the effectiveness of compression methods for model parallelism.
We implement and evaluate three common classes of compression algorithms.
We evaluate these methods across more than 160 settings and 8 popular datasets.
arXiv Detail & Related papers (2023-01-06T18:58:09Z) - On Optimizing the Communication of Model Parallelism [74.15423270435949]
We study a novel and important communication pattern in large-scale model-parallel deep learning (DL)
In cross-mesh resharding, a sharded tensor needs to be sent from a source device mesh to a destination device mesh.
We propose two contributions to address cross-mesh resharding: an efficient broadcast-based communication system, and an "overlapping-friendly" pipeline schedule.
arXiv Detail & Related papers (2022-11-10T03:56:48Z) - Parallel Training of Deep Networks with Local Updates [84.30918922367442]
Local parallelism is a framework which parallelizes training of individual layers in deep networks by replacing global backpropagation with truncated layer-wise backpropagation.
We show results in both vision and language domains across a diverse set of architectures, and find that local parallelism is particularly effective in the high-compute regime.
arXiv Detail & Related papers (2020-12-07T16:38:45Z) - Restructuring, Pruning, and Adjustment of Deep Models for Parallel
Distributed Inference [15.720414948573753]
We consider the parallel implementation of an already-trained deep model on multiple processing nodes (a.k.a. workers)
We propose RePurpose, a layer-wise model restructuring and pruning technique that guarantees the performance of the overall parallelized model.
We show that, compared to the existing methods, RePurpose significantly improves the efficiency of the distributed inference via parallel implementation.
arXiv Detail & Related papers (2020-08-19T06:44:41Z) - Understanding the Effects of Data Parallelism and Sparsity on Neural
Network Training [126.49572353148262]
We study two factors in neural network training: data parallelism and sparsity.
Despite their promising benefits, understanding of their effects on neural network training remains elusive.
arXiv Detail & Related papers (2020-03-25T10:49:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.