Related papers: Sequence-to-sequence models for workload interference

Sequence-to-sequence models for workload interference

URL: http://arxiv.org/abs/2006.14429v2
Date: Mon, 6 Jul 2020 14:23:49 GMT
Title: Sequence-to-sequence models for workload interference
Authors: David Buchaca Prats, Joan Marcual, Josep Llu\'is Berral, David Carrera
Abstract summary: Co-scheduling of jobs in data-centers is a challenging scenario, where jobs can compete for resources yielding to severe slowdowns or failed executions. Current techniques, most of them already involving machine learning and job modeling, are based on workload behavior summarization across time. We propose a methodology for modeling co-scheduling of jobs on data-centers, based on their behavior towards resources and execution time.
Score: 1.988145627448243
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Co-scheduling of jobs in data-centers is a challenging scenario, where jobs can compete for resources yielding to severe slowdowns or failed executions. Efficient job placement on environments where resources are shared requires awareness on how jobs interfere during execution, to go far beyond ineffective resource overbooking techniques. Current techniques, most of them already involving machine learning and job modeling, are based on workload behavior summarization across time, instead of focusing on effective job requirements at each instant of the execution. In this work we propose a methodology for modeling co-scheduling of jobs on data-centers, based on their behavior towards resources and execution time, using sequence-to-sequence models based on recurrent neural networks. The goal is to forecast co-executed jobs footprint on resources along their execution time, from the profile shown by the individual jobs, to enhance resource managers and schedulers placement decisions. The methods here presented are validated using High Performance Computing benchmarks based on different frameworks (like Hadoop and Spark) and applications (CPU bound, IO bound, machine learning, SQL queries...). Experiments show that the model can correctly identify the resource usage trends from previously seen and even unseen co-scheduled jobs.

Related papers

Benchmarking Agentic Workflow Generation [80.74757493266057]
We introduce WorFBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures. We also present WorFEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms. We observe that the generated can enhance downstream tasks, enabling them to achieve superior performance with less time during inference.
arXiv Detail & Related papers (2024-10-10T12:41:19Z)
Venn: Resource Management for Collaborative Learning Jobs [24.596584073531886]
Collaborative learning (CL) has emerged as a promising approach for machine learning (ML) and data science across distributed edge devices. In this paper, we present Venn, a CL resource manager that efficiently schedules heterogeneous devices among multiple CL jobs. Our evaluation shows that, compared to the state-of-the-art CL resource managers, Venn improves the average JCT by up to 1.88x.
arXiv Detail & Related papers (2023-12-13T17:13:08Z)
A Memetic Algorithm with Reinforcement Learning for Sociotechnical Production Scheduling [0.0]
This article presents a memetic algorithm with applying deep reinforcement learning (DRL) to flexible job shop scheduling problems (DRC-FJSSP) From research projects in industry, we recognize the need to consider flexible machines, flexible human workers, worker capabilities, setup and processing operations, material arrival times, complex job paths with parallel tasks for bill of material manufacturing, sequence-dependent setup times and (partially) automated tasks in human-machine-collaboration.
arXiv Detail & Related papers (2022-12-21T11:24:32Z)
Asynchronous Execution of Heterogeneous Tasks in ML-driven HPC Workflows [1.376408511310322]
Asynchronous execution is crucial to improve resource utilization, task throughput and reduce' makespan. We investigate the requirements and properties of the asynchronous task execution of machine learning (ML)-driven high performance computing. Our experiments represent relevant scientific drivers, we perform them at scale on Summit, and we show that the performance enhancements due to asynchronous execution are consistent with our model.
arXiv Detail & Related papers (2022-08-23T16:25:48Z)
Concepts and Algorithms for Agent-based Decentralized and Integrated Scheduling of Production and Auxiliary Processes [78.120734120667]
This paper describes an agent-based decentralized and integrated scheduling approach. Part of the requirements is to develop a linearly scaling communication architecture. The approach is explained using an example based on industrial requirements.
arXiv Detail & Related papers (2022-05-06T18:44:29Z)
Active Multi-Task Representation Learning [50.13453053304159]
We give the first formal study on resource task sampling by leveraging the techniques from active learning. We propose an algorithm that iteratively estimates the relevance of each source task to the target task and samples from each source task based on the estimated relevance.
arXiv Detail & Related papers (2022-02-02T08:23:24Z)
On the Potential of Execution Traces for Batch Processing Workload Optimization in Public Clouds [0.0]
We propose a collaborative approach for sharing anonymized workload execution traces among users. We mining them for general patterns, and exploiting clusters of historical workloads for future optimizations.
arXiv Detail & Related papers (2021-11-16T20:11:36Z)
Enel: Context-Aware Dynamic Scaling of Distributed Dataflow Jobs using Graph Propagation [52.9168275057997]
This paper presents Enel, a novel dynamic scaling approach that uses message propagation on an attributed graph to model dataflow jobs. We show that Enel is able to identify effective rescaling actions, reacting for instance to node failures, and can be reused across different execution contexts.
arXiv Detail & Related papers (2021-08-27T10:21:08Z)
Bellamy: Reusing Performance Models for Distributed Dataflow Jobs Across Contexts [52.9168275057997]
This paper presents Bellamy, a novel modeling approach that combines scale-outs, dataset sizes, and runtimes with additional descriptive properties of a dataflow job. We evaluate our approach on two publicly available datasets consisting of execution data from various dataflow jobs carried out in different environments.
arXiv Detail & Related papers (2021-07-29T11:57:38Z)
Exploring Relational Context for Multi-Task Dense Prediction [76.86090370115]
We consider a multi-task environment for dense prediction tasks, represented by a common backbone and independent task-specific heads. We explore various attention-based contexts, such as global and local, in the multi-task setting. We propose an Adaptive Task-Relational Context module, which samples the pool of all available contexts for each task pair.
arXiv Detail & Related papers (2021-04-28T16:45:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.