Singularity: Planet-Scale, Preemptible, Elastic Scheduling of AI
Workloads
- URL: http://arxiv.org/abs/2202.07848v1
- Date: Wed, 16 Feb 2022 04:02:10 GMT
- Title: Singularity: Planet-Scale, Preemptible, Elastic Scheduling of AI
Workloads
- Authors: Dharma Shukla, Muthian Sivathanu, Srinidhi Viswanatha, Bhargav
Gulavani, Rimma Nehme, Amey Agrawal, Chen Chen, Nipun Kwatra, Ramachandran
Ramjee, Pankaj Sharma, Atul Katiyar, Vipul Modi, Vaibhav Sharma, Abhishek
Singh, Shreshth Singhal, Kaustubh Welankar, Lu Xun, Ravi Anupindi, Karthik
Elangovan, Hasibur Rahman, Zhou Lin, Rahul Seetharaman, Cheng Xu, Eddie
Ailijiang, Suresh Krishnappa, Mark Russinovich (Microsoft)
- Abstract summary: We present Singularity, Microsoft's globally distributed scheduling service for deep learning training and inference workloads.
At the heart of Singularity is a novel, workload-aware scheduler that can transparently preempt and elastically scale deep learning workloads.
We show that the resulting efficiency and reliability gains with Singularity are achieved with negligible impact on the steady-state performance.
- Score: 12.117736592836506
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Lowering costs by driving high utilization across deep learning workloads is
a crucial lever for cloud providers. We present Singularity, Microsoft's
globally distributed scheduling service for highly-efficient and reliable
execution of deep learning training and inference workloads. At the heart of
Singularity is a novel, workload-aware scheduler that can transparently preempt
and elastically scale deep learning workloads to drive high utilization without
impacting their correctness or performance, across a global fleet of AI
accelerators (e.g., GPUs, FPGAs).
All jobs in Singularity are preemptable, migratable, and dynamically
resizable (elastic) by default: a live job can be dynamically and transparently
(a) preempted and migrated to a different set of nodes, cluster, data center or
a region and resumed exactly from the point where the execution was preempted,
and (b) resized (i.e., elastically scaled-up/down) on a varying set of
accelerators of a given type. Our mechanisms are transparent in that they do
not require the user to make any changes to their code or require using any
custom libraries that may limit flexibility. Additionally, our approach
significantly improves the reliability of deep learning workloads. We show that
the resulting efficiency and reliability gains with Singularity are achieved
with negligible impact on the steady-state performance. Finally, our design
approach is agnostic of DNN architectures and handles a variety of parallelism
strategies (e.g., data/pipeline/model parallelism).
Related papers
- Efficient Federated Learning against Heterogeneous and Non-stationary Client Unavailability [23.466997173249034]
FedAPM includes novel structures that (i) for missed computations due to unavailability with only $(1)O$ additional memory computation with respect to standard FedAvg.
We show that FedAPM converges to a stationary point even non-stationary algorithm despite being non-stationary dynamics.
arXiv Detail & Related papers (2024-09-26T00:38:18Z) - Context-Aware Orchestration of Energy-Efficient Gossip Learning Schemes [8.382766344930157]
We present a distributed training approach based on the combination of Gossip Learning with adaptive optimization of the learning process.
We propose a data-driven approach to OGL management that relies on optimizing in real-time for each node.
Results suggest that our approach is highly efficient and effective in a broad spectrum of network scenarios.
arXiv Detail & Related papers (2024-04-18T09:17:46Z) - Step-On-Feet Tuning: Scaling Self-Alignment of LLMs via Bootstrapping [53.454408491386886]
bootstrapping self-alignment markedly surpasses the single-round approach.
We propose Step-On-Feet Tuning (SOFT) which leverages model's continuously enhanced few-shot ability to boost zero or one-shot performance.
Based on easy-to-hard training recipe, we propose SOFT+ which further boost self-alignment's performance.
arXiv Detail & Related papers (2024-02-12T12:30:42Z) - Action-Quantized Offline Reinforcement Learning for Robotic Skill
Learning [68.16998247593209]
offline reinforcement learning (RL) paradigm provides recipe to convert static behavior datasets into policies that can perform better than the policy that collected the data.
In this paper, we propose an adaptive scheme for action quantization.
We show that several state-of-the-art offline RL methods such as IQL, CQL, and BRAC improve in performance on benchmarks when combined with our proposed discretization scheme.
arXiv Detail & Related papers (2023-10-18T06:07:10Z) - ON-DEMAND-FL: A Dynamic and Efficient Multi-Criteria Federated Learning
Client Deployment Scheme [37.099990745974196]
We introduce an On-Demand-FL, a client deployment approach for federated learning.
We make use of containerization technology such as Docker to build efficient environments.
The Genetic algorithm (GA) is used to solve the multi-objective optimization problem.
arXiv Detail & Related papers (2022-11-05T13:41:19Z) - FedGradNorm: Personalized Federated Gradient-Normalized Multi-Task
Learning [50.756991828015316]
Multi-task learning (MTL) is a novel framework to learn several tasks simultaneously with a single shared network.
We propose FedGradNorm which uses a dynamic-weighting method to normalize norms in order to balance learning speeds among different tasks.
arXiv Detail & Related papers (2022-03-24T17:43:12Z) - Aggregation Service for Federated Learning: An Efficient, Secure, and
More Resilient Realization [22.61730495802799]
We present a system design which offers efficient protection of individual model updates throughout the learning procedure.
Our system achieves accuracy comparable to the baseline, with practical performance.
arXiv Detail & Related papers (2022-02-04T05:03:46Z) - Hyperparameter-free Continuous Learning for Domain Classification in
Natural Language Understanding [60.226644697970116]
Domain classification is the fundamental task in natural language understanding (NLU)
Most existing continual learning approaches suffer from low accuracy and performance fluctuation.
We propose a hyper parameter-free continual learning model for text data that can stably produce high performance under various environments.
arXiv Detail & Related papers (2022-01-05T02:46:16Z) - Efficient Feature Transformations for Discriminative and Generative
Continual Learning [98.10425163678082]
We propose a simple task-specific feature map transformation strategy for continual learning.
Theses provide powerful flexibility for learning new tasks, achieved with minimal parameters added to the base architecture.
We demonstrate the efficacy and efficiency of our method with an extensive set of experiments in discriminative (CIFAR-100 and ImageNet-1K) and generative sequences of tasks.
arXiv Detail & Related papers (2021-03-25T01:48:14Z) - Effective Elastic Scaling of Deep Learning Workloads [3.345876096131764]
We examine the elastic scaling of Deep Learning (DL) jobs over large-scale training platforms.
We propose a novel resource allocation strategy for DL training jobs, resulting in improved job run time performance as well as increased cluster utilization.
arXiv Detail & Related papers (2020-06-24T17:01:09Z) - Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of
Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models.
This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.