Related papers: Venn: Resource Management for Collaborative Learning Jobs

Venn: Resource Management for Collaborative Learning Jobs

URL: http://arxiv.org/abs/2312.08298v2
Date: Wed, 30 Apr 2025 02:21:01 GMT
Title: Venn: Resource Management for Collaborative Learning Jobs
Authors: Jiachen Liu, Fan Lai, Ding Ding, Yiwen Zhang, Mosharaf Chowdhury,
Abstract summary: Collaborative learning (CL) has emerged as a promising approach for machine learning (ML) and data science across distributed edge devices.<n>In this paper, we present Venn, a CL resource manager that efficiently schedules heterogeneous devices among multiple CL jobs.<n>Our evaluation shows that, compared to the state-of-the-art CL resource managers, Venn improves the average JCT by up to 1.88x.
Score: 24.596584073531886
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent years, collaborative learning (CL) has emerged as a promising approach for machine learning (ML) and data science across distributed edge devices. As the deployment of CL jobs increases, they inevitably contend for limited resources. However, efficient resource scheduling in this context is challenging because of the ephemeral nature and resource heterogeneity of devices, coupled with the overlapping resource requirements of diverse CL jobs. Existing resource managers often assign devices to CL jobs randomly for simplicity and scalability, but this approach compromises job efficiency. In this paper, we present Venn, a CL resource manager that efficiently schedules ephemeral, heterogeneous devices among multiple CL jobs to reduce the average job completion time (JCT). Venn formulates the Intersection Resource Scheduling (IRS) problem to identify complex resource contention among multiple CL jobs. It then proposes a contention-aware scheduling heuristic to minimize the average scheduling delay. Furthermore, it proposes a resource-aware device-to-job matching heuristic to optimize response collection time by mitigating stragglers. Our evaluation shows that, compared to the state-of-the-art CL resource managers, Venn improves the average JCT by up to 1.88x. The code is available at https://github.com/SymbioticLab/Venn.

Related papers

Topology-aware Preemptive Scheduling for Co-located LLM Workloads [7.240168647854797]
We develop a fine-grained topology-aware method for scheduling of hybrid workloads. This method significantly increases the efficiency of preemption and improves overall scheduled performance for LLM workloads by $55%$.
arXiv Detail & Related papers (2024-11-18T13:26:09Z)
FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications. FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z)
Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs. We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z)
FedLPS: Heterogeneous Federated Learning for Multiple Tasks with Local Parameter Sharing [14.938531944702193]
We propose Federated Learning with Local Heterogeneous Sharing (FedLPS) FedLPS uses transfer learning to facilitate the deployment of multiple tasks on a single device by dividing the local model into a shareable encoder and task-specific encoders. FedLPS significantly outperforms the state-of-the-art (SOTA) FL frameworks by up to 4.88% and reduces the computational resource consumption by 21.3%.
arXiv Detail & Related papers (2024-02-13T16:30:30Z)
RecDCL: Dual Contrastive Learning for Recommendation [65.6236784430981]
We propose a dual contrastive learning recommendation framework -- RecDCL. In RecDCL, the FCL objective is designed to eliminate redundant solutions on user-item positive pairs. The BCL objective is utilized to generate contrastive embeddings on output vectors for enhancing the robustness of the representations.
arXiv Detail & Related papers (2024-01-28T11:51:09Z)
Client Orchestration and Cost-Efficient Joint Optimization for NOMA-Enabled Hierarchical Federated Learning [55.49099125128281]
We propose a non-orthogonal multiple access (NOMA) enabled HFL system under semi-synchronous cloud model aggregation. We show that the proposed scheme outperforms the considered benchmarks regarding HFL performance improvement and total cost reduction.
arXiv Detail & Related papers (2023-11-03T13:34:44Z)
FLrce: Resource-Efficient Federated Learning with Early-Stopping Strategy [7.963276533979389]
Federated Learning (FL) achieves great popularity in the Internet of Things (IoT) We present FLrce, an efficient FL framework with a relationship-based client selection and early-stopping strategy. Experiment results show that, compared with existing efficient FL frameworks, FLrce improves the computation and communication efficiency by at least 30% and 43% respectively.
arXiv Detail & Related papers (2023-10-15T10:13:44Z)
Bridging the Gap Between Foundation Models and Heterogeneous Federated Learning [9.198799314774437]
Federated learning (FL) offers privacy-preserving decentralized machine learning, optimizing models at edge clients without sharing private data. Foundation models (FMs) have gained traction in the artificial intelligence (AI) community due to their exceptional performance across various tasks. We present an adaptive framework for Resource-aware Federated Foundation Models (RaFFM) to address these challenges.
arXiv Detail & Related papers (2023-09-30T04:31:53Z)
Joint Age-based Client Selection and Resource Allocation for Communication-Efficient Federated Learning over NOMA Networks [8.030674576024952]
In federated learning (FL), distributed clients can collaboratively train a shared global model while retaining their own training data locally. In this paper, a joint optimization problem of client selection and resource allocation is formulated, aiming to minimize the total time consumption of each round in FL over a non-orthogonal multiple access (NOMA) enabled wireless network. In addition, a server-side artificial neural network (ANN) is proposed to predict the FL models of clients who are not selected at each round to further improve FL performance.
arXiv Detail & Related papers (2023-04-18T13:58:16Z)
Automated Federated Learning in Mobile Edge Networks -- Fast Adaptation and Convergence [83.58839320635956]
Federated Learning (FL) can be used in mobile edge networks to train machine learning models in a distributed manner. Recent FL has been interpreted within a Model-Agnostic Meta-Learning (MAML) framework, which brings FL significant advantages in fast adaptation and convergence over heterogeneous datasets. This paper addresses how much benefit MAML brings to FL and how to maximize such benefit over mobile edge networks.
arXiv Detail & Related papers (2023-03-23T02:42:10Z)
Partitioning Distributed Compute Jobs with Reinforcement Learning and Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields. Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices. We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z)
Scheduling and Aggregation Design for Asynchronous Federated Learning over Wireless Networks [56.91063444859008]
Federated Learning (FL) is a collaborative machine learning framework that combines on-device training and server-based aggregation. We propose an asynchronous FL design with periodic aggregation to tackle the straggler issue in FL systems. We show that an age-aware'' aggregation weighting design can significantly improve the learning performance in an asynchronous FL setting.
arXiv Detail & Related papers (2022-12-14T17:33:01Z)
Multi-Job Intelligent Scheduling with Cross-Device Federated Learning [65.69079337653994]
Federated Learning (FL) enables collaborative global machine learning model training without sharing sensitive raw data. We propose a novel multi-job FL framework, which enables the training process of multiple jobs in parallel. We propose a novel intelligent scheduling approach based on multiple scheduling methods, including an original reinforcement learning-based scheduling method and an original Bayesian optimization-based scheduling method.
arXiv Detail & Related papers (2022-11-24T06:17:40Z)
SlimFL: Federated Learning with Superposition Coding over Slimmable Neural Networks [56.68149211499535]
Federated learning (FL) is a key enabler for efficient communication and computing leveraging devices' distributed computing capabilities. This paper proposes a novel learning framework by integrating FL and width-adjustable slimmable neural networks (SNNs) We propose a communication and energy-efficient SNN-based FL (named SlimFL) that jointly utilizes superposition coding (SC) for global model aggregation and superposition training (ST) for updating local models.
arXiv Detail & Related papers (2022-03-26T15:06:13Z)
How Does Cell-Free Massive MIMO Support Multiple Federated Learning Groups? [42.63398054091038]
We propose a cell-free massive multiple-input multiple-output (MIMO) network to guarantee the stable operation of multiple FL processes. We then develop a novel scheme that asynchronously executes the iterations of FL processes under multicasting downlink and conventional uplink transmission protocols.
arXiv Detail & Related papers (2021-07-20T15:46:53Z)
Overcoming Catastrophic Forgetting with Gaussian Mixture Replay [79.0660895390689]
We present a rehearsal-based approach for continual learning (CL) based on Gaussian Mixture Models (GMM) We mitigate catastrophic forgetting (CF) by generating samples from previous tasks and merging them with current training data. We evaluate GMR on multiple image datasets, which are divided into class-disjoint sub-tasks.
arXiv Detail & Related papers (2021-04-19T11:41:34Z)
Rosella: A Self-Driving Distributed Scheduler for Heterogeneous Clusters [7.206919625027208]
We present Rosella, a new self-driving, distributed approach for task scheduling in heterogeneous clusters. Rosella automatically learns the compute environment and adjusts its scheduling policy in real-time. We evaluate Rosella with a variety of workloads on a 32-node AWS cluster.
arXiv Detail & Related papers (2020-10-28T20:12:29Z)
Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning [61.29990368322931]
Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-optimizing inter-dependent factors. Pollux reduces average job completion times by 37-50% relative to state-of-the-art DL schedulers.
arXiv Detail & Related papers (2020-08-27T16:56:48Z)
Delay Minimization for Federated Learning Over Wireless Communication Networks [172.42768672943365]
The problem of delay computation for federated learning (FL) over wireless communication networks is investigated. A bisection search algorithm is proposed to obtain the optimal solution. Simulation results show that the proposed algorithm can reduce delay by up to 27.3% compared to conventional FL methods.
arXiv Detail & Related papers (2020-07-05T19:00:07Z)
Sequence-to-sequence models for workload interference [1.988145627448243]
Co-scheduling of jobs in data-centers is a challenging scenario, where jobs can compete for resources yielding to severe slowdowns or failed executions. Current techniques, most of them already involving machine learning and job modeling, are based on workload behavior summarization across time. We propose a methodology for modeling co-scheduling of jobs on data-centers, based on their behavior towards resources and execution time.
arXiv Detail & Related papers (2020-06-25T14:11:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.