Related papers: COMET: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training

COMET: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training

URL: http://arxiv.org/abs/2211.16648v2
Date: Thu, 14 Mar 2024 15:06:02 GMT
Title: COMET: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training
Authors: Divya Kiran Kadiyala, Saeed Rashidi, Taekyung Heo, Abhimanyu Rajeshkumar Bambhaniya, Tushar Krishna, Alexandros Daglis,
Abstract summary: Modern Deep Learning (DL) models have grown to sizes requiring massive clusters of specialized, high-end nodes to train. designing such clusters to maximize both performance and utilization--to amortize their steep cost--is a challenging task. We introduce COMET, a holistic cluster design methodology and workflow to jointly study the impact of parallelization strategies and key cluster resource provisioning on the performance of distributed DL training.
Score: 42.514897110537596
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Modern Deep Learning (DL) models have grown to sizes requiring massive clusters of specialized, high-end nodes to train. Designing such clusters to maximize both performance and utilization--to amortize their steep cost--is a challenging task requiring careful balance of compute, memory, and network resources. Moreover, a plethora of each model's tuning knobs drastically affect the performance, with optimal values often depending on the underlying cluster's characteristics, which necessitates a complex cluster-workload co-design process. To facilitate the design space exploration of such massive DL training clusters, we introduce COMET, a holistic cluster design methodology and workflow to jointly study the impact of parallelization strategies and key cluster resource provisioning on the performance of distributed DL training. We develop a step-by-step process to establish a reusable and flexible methodology, and demonstrate its application with case studies of training large models on cluster configurations of variable compute, memory, and network resources. Our case studies demonstrate COMET's utility in identifying promising architectural optimization directions and guiding system designers in configuring key model and cluster parameters. To illustrate, cluster configuration comparisons identify performance differences of up to 7.7x and highlight performance optimization opportunities of up to 1.4x when employing memory expansion as an optimization technique.

Related papers

High-Throughput LLM inference on Heterogeneous Clusters [6.11367906161332]
Large language model (LLM) inference on heterogeneous clusters presents two main challenges. A novel mechanism is proposed to schedule requests among instances, which fully considers the different processing capabilities of various instances. Extensive experiments show that the proposed scheduler improves throughput by 122.5% and 33.6% on two heterogeneous clusters.
arXiv Detail & Related papers (2025-04-18T08:59:11Z)
Towards Learnable Anchor for Deep Multi-View Clustering [49.767879678193005]
In this paper, we propose the Deep Multi-view Anchor Clustering (DMAC) model that performs clustering in linear time. With the optimal anchors, the full sample graph is calculated to derive a discriminative embedding for clustering. Experiments on several datasets demonstrate superior performance and efficiency of DMAC compared to state-of-the-art competitors.
arXiv Detail & Related papers (2025-03-16T09:38:11Z)
LCFed: An Efficient Clustered Federated Learning Framework for Heterogeneous Data [21.341280782748278]
Clustered federated learning (CFL) addresses the performance challenges posed by data heterogeneity in federated learning (FL) Existing CFL approaches strictly limit knowledge sharing to within clusters, lacking the integration of global knowledge with intra-cluster training. We propose LCFed, an efficient CFL framework to combat these challenges.
arXiv Detail & Related papers (2025-01-03T14:59:48Z)
Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction [52.09472099976885]
IAR is an Improved AutoRegressive Visual Generation Method that enhances the training efficiency and generation quality of LLM-based visual generation models. Our method consistently enhances the model training efficiency and performance from 100M to 1.4B, reducing the training time by half while achieving the same FID.
arXiv Detail & Related papers (2025-01-01T15:58:51Z)
Graph Cut-guided Maximal Coding Rate Reduction for Learning Image Embedding and Clustering [2.4503870408262354]
We propose a unified framework, termed graph Cut-guided Maximal Coding Rate Reduction (CgMCR), for jointly learning the structured embeddings and the clustering. We conduct extensive experiments on both standard and out-of-domain image datasets and experimental results validate the effectiveness of our approach.
arXiv Detail & Related papers (2024-12-25T15:20:54Z)
A Survey on Inference Optimization Techniques for Mixture of Experts Models [50.40325411764262]
Large-scale Mixture of Experts (MoE) models offer enhanced model capacity and computational efficiency through conditional computation. deploying and running inference on these models presents significant challenges in computational resources, latency, and energy efficiency. This survey analyzes optimization techniques for MoE models across the entire system stack.
arXiv Detail & Related papers (2024-12-18T14:11:15Z)
Towards Automated Model Design on Recommender Systems [21.421326082345136]
We introduce a novel paradigm that utilizes weight sharing to explore abundant solution spaces. From a co-design perspective, we achieve 2x FLOPs efficiency, 1.8x energy efficiency, and 1.5x performance improvements in recommender models.
arXiv Detail & Related papers (2024-11-12T06:03:47Z)
Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models. Our approach employs activation sparsity to extract experts. Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z)
RGM: A Robust Generalizable Matching Model [49.60975442871967]
We propose a deep model for sparse and dense matching, termed RGM (Robust Generalist Matching) To narrow the gap between synthetic training samples and real-world scenarios, we build a new, large-scale dataset with sparse correspondence ground truth. We are able to mix up various dense and sparse matching datasets, significantly improving the training diversity.
arXiv Detail & Related papers (2023-10-18T07:30:08Z)
A Generalized Framework for Predictive Clustering and Optimization [18.06697544912383]
Clustering is a powerful and extensively used data science tool. In this article, we define a generalized optimization framework for predictive clustering. We also present a joint optimization strategy that exploits mixed-integer linear programming (MILP) for global optimization.
arXiv Detail & Related papers (2023-05-07T19:56:51Z)
Scaling Pre-trained Language Models to Deeper via Parameter-efficient Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO) MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts. Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z)
Dynamic Clustering and Cluster Contrastive Learning for Unsupervised Person Re-identification [29.167783500369442]
Unsupervised Re-ID methods aim at learning robust and discriminative features from unlabeled data. We propose a dynamic clustering and cluster contrastive learning (DCCC) method. Experiments on several widely used public datasets validate the effectiveness of our proposed DCCC.
arXiv Detail & Related papers (2023-03-13T01:56:53Z)
Unifying Synergies between Self-supervised Learning and Dynamic Computation [53.66628188936682]
We present a novel perspective on the interplay between SSL and DC paradigms. We show that it is feasible to simultaneously learn a dense and gated sub-network from scratch in a SSL setting. The co-evolution during pre-training of both dense and gated encoder offers a good accuracy-efficiency trade-off.
arXiv Detail & Related papers (2023-01-22T17:12:58Z)
AMP: Automatically Finding Model Parallel Strategies with Heterogeneity Awareness [10.20441432750275]
We develop AMP, a framework that automatically derives model-parallel execution strategies. We evaluate AMP on popular models and cluster setups from public clouds. AMP finds strategies with 1.54x and 1.77x higher throughput than state-of-the-art model-parallel systems.
arXiv Detail & Related papers (2022-10-13T18:55:28Z)
Deep Attention-guided Graph Clustering with Dual Self-supervision [49.040136530379094]
We propose a novel method, namely deep attention-guided graph clustering with dual self-supervision (DAGC) We develop a dual self-supervision solution consisting of a soft self-supervision strategy with a triplet Kullback-Leibler divergence loss and a hard self-supervision strategy with a pseudo supervision loss. Our method consistently outperforms state-of-the-art methods on six benchmark datasets.
arXiv Detail & Related papers (2021-11-10T06:53:03Z)
Distributed Training of Deep Learning Models: A Taxonomic Perspective [11.924058430461216]
Distributed deep learning systems (DDLS) train deep neural network models by utilizing the distributed resources of a cluster. We aim to shine some light on the fundamental principles that are at work when training deep neural networks in a cluster of independent machines.
arXiv Detail & Related papers (2020-07-08T08:56:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.