Related papers: Singularity: Planet-Scale, Preemptible, Elastic Scheduling of AI Workloads

Singularity: Planet-Scale, Preemptible, Elastic Scheduling of AI Workloads

URL: http://arxiv.org/abs/2202.07848v1
Date: Wed, 16 Feb 2022 04:02:10 GMT
Title: Singularity: Planet-Scale, Preemptible, Elastic Scheduling of AI Workloads
Authors: Dharma Shukla, Muthian Sivathanu, Srinidhi Viswanatha, Bhargav Gulavani, Rimma Nehme, Amey Agrawal, Chen Chen, Nipun Kwatra, Ramachandran Ramjee, Pankaj Sharma, Atul Katiyar, Vipul Modi, Vaibhav Sharma, Abhishek Singh, Shreshth Singhal, Kaustubh Welankar, Lu Xun, Ravi Anupindi, Karthik Elangovan, Hasibur Rahman, Zhou Lin, Rahul Seetharaman, Cheng Xu, Eddie Ailijiang, Suresh Krishnappa, Mark Russinovich (Microsoft)
Abstract summary: We present Singularity, Microsoft's globally distributed scheduling service for deep learning training and inference workloads. At the heart of Singularity is a novel, workload-aware scheduler that can transparently preempt and elastically scale deep learning workloads. We show that the resulting efficiency and reliability gains with Singularity are achieved with negligible impact on the steady-state performance.
Score: 12.117736592836506
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Lowering costs by driving high utilization across deep learning workloads is a crucial lever for cloud providers. We present Singularity, Microsoft's globally distributed scheduling service for highly-efficient and reliable execution of deep learning training and inference workloads. At the heart of Singularity is a novel, workload-aware scheduler that can transparently preempt and elastically scale deep learning workloads to drive high utilization without impacting their correctness or performance, across a global fleet of AI accelerators (e.g., GPUs, FPGAs). All jobs in Singularity are preemptable, migratable, and dynamically resizable (elastic) by default: a live job can be dynamically and transparently (a) preempted and migrated to a different set of nodes, cluster, data center or a region and resumed exactly from the point where the execution was preempted, and (b) resized (i.e., elastically scaled-up/down) on a varying set of accelerators of a given type. Our mechanisms are transparent in that they do not require the user to make any changes to their code or require using any custom libraries that may limit flexibility. Additionally, our approach significantly improves the reliability of deep learning workloads. We show that the resulting efficiency and reliability gains with Singularity are achieved with negligible impact on the steady-state performance. Finally, our design approach is agnostic of DNN architectures and handles a variety of parallelism strategies (e.g., data/pipeline/model parallelism).

Related papers

Dynamic Allocation Hypernetwork with Adaptive Model Recalibration for Federated Continual Learning [49.508844889242425]
We propose a novel server-side FCL pattern in medical domain, Dynamic Allocation Hypernetwork with adaptive model recalibration (FedDAH) FedDAH is designed to facilitate collaborative learning under the distinct and dynamic task streams across clients. For the biased optimization, we introduce a novel adaptive model recalibration (AMR) to incorporate the candidate changes of historical models into current server updates.
arXiv Detail & Related papers (2025-03-25T00:17:47Z)
Incentive-Compatible Federated Learning with Stackelberg Game Modeling [11.863770989724959]
We introduce FLamma, a novel Federated Learning framework based on adaptive gamma-based Stackelberg game. Our approach allows the server to act as the leader, dynamically adjusting a decay factor while clients, acting as followers, optimally select their number of local epochs to maximize their utility. Over time, the server incrementally balances client influence, initially rewarding higher-contributing clients and gradually leveling their impact, driving the system toward a Stackelberg Equilibrium.
arXiv Detail & Related papers (2025-01-05T21:04:41Z)
Efficient Federated Learning against Heterogeneous and Non-stationary Client Unavailability [23.466997173249034]
FedAPM includes novel structures that (i) for missed computations due to unavailability with only $(1)O$ additional memory computation with respect to standard FedAvg. We show that FedAPM converges to a stationary point even non-stationary algorithm despite being non-stationary dynamics.
arXiv Detail & Related papers (2024-09-26T00:38:18Z)
Context-Aware Orchestration of Energy-Efficient Gossip Learning Schemes [8.382766344930157]
We present a distributed training approach based on the combination of Gossip Learning with adaptive optimization of the learning process. We propose a data-driven approach to OGL management that relies on optimizing in real-time for each node. Results suggest that our approach is highly efficient and effective in a broad spectrum of network scenarios.
arXiv Detail & Related papers (2024-04-18T09:17:46Z)
Step-On-Feet Tuning: Scaling Self-Alignment of LLMs via Bootstrapping [53.454408491386886]
bootstrapping self-alignment markedly surpasses the single-round approach. We propose Step-On-Feet Tuning (SOFT) which leverages model's continuously enhanced few-shot ability to boost zero or one-shot performance. Based on easy-to-hard training recipe, we propose SOFT+ which further boost self-alignment's performance.
arXiv Detail & Related papers (2024-02-12T12:30:42Z)
Action-Quantized Offline Reinforcement Learning for Robotic Skill Learning [68.16998247593209]
offline reinforcement learning (RL) paradigm provides recipe to convert static behavior datasets into policies that can perform better than the policy that collected the data. In this paper, we propose an adaptive scheme for action quantization. We show that several state-of-the-art offline RL methods such as IQL, CQL, and BRAC improve in performance on benchmarks when combined with our proposed discretization scheme.
arXiv Detail & Related papers (2023-10-18T06:07:10Z)
ON-DEMAND-FL: A Dynamic and Efficient Multi-Criteria Federated Learning Client Deployment Scheme [37.099990745974196]
We introduce an On-Demand-FL, a client deployment approach for federated learning. We make use of containerization technology such as Docker to build efficient environments. The Genetic algorithm (GA) is used to solve the multi-objective optimization problem.
arXiv Detail & Related papers (2022-11-05T13:41:19Z)
FedGradNorm: Personalized Federated Gradient-Normalized Multi-Task Learning [50.756991828015316]
Multi-task learning (MTL) is a novel framework to learn several tasks simultaneously with a single shared network. We propose FedGradNorm which uses a dynamic-weighting method to normalize norms in order to balance learning speeds among different tasks.
arXiv Detail & Related papers (2022-03-24T17:43:12Z)
Aggregation Service for Federated Learning: An Efficient, Secure, and More Resilient Realization [22.61730495802799]
We present a system design which offers efficient protection of individual model updates throughout the learning procedure. Our system achieves accuracy comparable to the baseline, with practical performance.
arXiv Detail & Related papers (2022-02-04T05:03:46Z)
Hyperparameter-free Continuous Learning for Domain Classification in Natural Language Understanding [60.226644697970116]
Domain classification is the fundamental task in natural language understanding (NLU) Most existing continual learning approaches suffer from low accuracy and performance fluctuation. We propose a hyper parameter-free continual learning model for text data that can stably produce high performance under various environments.
arXiv Detail & Related papers (2022-01-05T02:46:16Z)
Efficient Feature Transformations for Discriminative and Generative Continual Learning [98.10425163678082]
We propose a simple task-specific feature map transformation strategy for continual learning. Theses provide powerful flexibility for learning new tasks, achieved with minimal parameters added to the base architecture. We demonstrate the efficacy and efficiency of our method with an extensive set of experiments in discriminative (CIFAR-100 and ImageNet-1K) and generative sequences of tasks.
arXiv Detail & Related papers (2021-03-25T01:48:14Z)
Effective Elastic Scaling of Deep Learning Workloads [3.345876096131764]
We examine the elastic scaling of Deep Learning (DL) jobs over large-scale training platforms. We propose a novel resource allocation strategy for DL training jobs, resulting in improved job run time performance as well as increased cluster utilization.
arXiv Detail & Related papers (2020-06-24T17:01:09Z)
Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models. This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.