LeJOT: An Intelligent Job Cost Orchestration Solution for Databricks Platform
- URL: http://arxiv.org/abs/2512.18266v1
- Date: Sat, 20 Dec 2025 08:09:58 GMT
- Title: LeJOT: An Intelligent Job Cost Orchestration Solution for Databricks Platform
- Authors: Lizhi Ma, Yi-Xiang Hu, Yuke Wang, Yifang Zhao, Yihui Ren, Jian-Xiang Liao, Feng Wu, Xiang-Yang Li,
- Abstract summary: We introduce LeJOT, an intelligent job cost orchestration framework for Databricks jobs.<n>LeJOT proactively predicts workload demands, dynamically allocates computing resources, and minimizes costs.<n>We show that LeJOT achieves an average 20% reduction in cloud computing costs within a minute-level scheduling timeframe.
- Score: 28.16213013287002
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the rapid advancements in big data technologies, the Databricks platform has become a cornerstone for enterprises and research institutions, offering high computational efficiency and a robust ecosystem. However, managing the escalating operational costs associated with job execution remains a critical challenge. Existing solutions rely on static configurations or reactive adjustments, which fail to adapt to the dynamic nature of workloads. To address this, we introduce LeJOT, an intelligent job cost orchestration framework that leverages machine learning for execution time prediction and a solver-based optimization model for real-time resource allocation. Unlike conventional scheduling techniques, LeJOT proactively predicts workload demands, dynamically allocates computing resources, and minimizes costs while ensuring performance requirements are met. Experimental results on real-world Databricks workloads demonstrate that LeJOT achieves an average 20% reduction in cloud computing costs within a minute-level scheduling timeframe, outperforming traditional static allocation strategies. Our approach provides a scalable and adaptive solution for cost-efficient job scheduling in Data Lakehouse environments.
Related papers
- Attention-Informed Surrogates for Navigating Power-Performance Trade-offs in HPC [0.5219568203653523]
We present a surrogate-assisted multi-objective Bayesian optimization (MOBO) framework to automate this complex decision.<n>Our core hypothesis is that surrogate models informed by attention-based embeddings of job telemetry can capture performance dynamics more effectively than standard regression techniques.<n>To our knowledge, this is the first work to successfully apply embedding-informed surrogates in a MOBO framework to the HPC scheduling problem.
arXiv Detail & Related papers (2026-01-21T19:11:12Z) - Evaluating the Efficacy of LLM-Based Reasoning for Multiobjective HPC Job Scheduling [6.375075345747834]
Large Language Model (LLM)-based scheduler using ReAct-style framework (Reason + Act)<n>System incorporates a scratchpad memory to track scheduling history and refine decisions via natural language feedback.<n>We evaluate our approach using OpenAI's O4-Mini and Anthropic's Claude 3.7 across seven real-world HPC workload scenarios.
arXiv Detail & Related papers (2025-05-29T14:25:29Z) - Edge-Cloud Collaborative Computing on Distributed Intelligence and Model Optimization: A Survey [58.50944604905037]
Edge-cloud collaborative computing (ECCC) has emerged as a pivotal paradigm for addressing the computational demands of modern intelligent applications.<n>Recent advancements in AI, particularly deep learning and large language models (LLMs), have dramatically enhanced the capabilities of these distributed systems.<n>This survey provides a structured tutorial on fundamental architectures, enabling technologies, and emerging applications.
arXiv Detail & Related papers (2025-05-03T13:55:38Z) - Dynamic Scheduling Strategies for Resource Optimization in Computing Environments [0.29008108937701327]
This paper proposes a container scheduling method based on multi-objective optimization, which aims to balance key performance indicators such as resource utilization, load balancing and task completion efficiency.<n>The experimental results show that compared with traditional static rule algorithms and efficiency algorithms, the optimized scheduling scheme shows significant advantages in resource utilization, load balancing and burst task completion.
arXiv Detail & Related papers (2024-12-23T05:43:17Z) - Reinforcement Learning for Adaptive Resource Scheduling in Complex System Environments [8.315191578007857]
This study presents a novel computer system performance optimization and adaptive workload management scheduling algorithm based on Q-learning.
By contrast, Q-learning, a reinforcement learning algorithm, continuously learns from system state changes, enabling dynamic scheduling and resource optimization.
This research provides a foundation for the integration of AI-driven adaptive scheduling in future large-scale systems, offering a scalable, intelligent solution to enhance system performance, reduce operating costs, and support sustainable energy consumption.
arXiv Detail & Related papers (2024-11-08T05:58:09Z) - Machine Learning Insides OptVerse AI Solver: Design Principles and
Applications [74.67495900436728]
We present a comprehensive study on the integration of machine learning (ML) techniques into Huawei Cloud's OptVerse AI solver.
We showcase our methods for generating complex SAT and MILP instances utilizing generative models that mirror multifaceted structures of real-world problem.
We detail the incorporation of state-of-the-art parameter tuning algorithms which markedly elevate solver performance.
arXiv Detail & Related papers (2024-01-11T15:02:15Z) - Dynamic Scheduling for Federated Edge Learning with Streaming Data [56.91063444859008]
We consider a Federated Edge Learning (FEEL) system where training data are randomly generated over time at a set of distributed edge devices with long-term energy constraints.
Due to limited communication resources and latency requirements, only a subset of devices is scheduled for participating in the local training process in every iteration.
arXiv Detail & Related papers (2023-05-02T07:41:16Z) - Sustainable AIGC Workload Scheduling of Geo-Distributed Data Centers: A
Multi-Agent Reinforcement Learning Approach [48.18355658448509]
Recent breakthroughs in generative artificial intelligence have triggered a surge in demand for machine learning training, which poses significant cost burdens and environmental challenges due to its substantial energy consumption.
Scheduling training jobs among geographically distributed cloud data centers unveils the opportunity to optimize the usage of computing capacity powered by inexpensive and low-carbon energy.
We propose an algorithm based on multi-agent reinforcement learning and actor-critic methods to learn the optimal collaborative scheduling strategy through interacting with a cloud system built with real-life workload patterns, energy prices, and carbon intensities.
arXiv Detail & Related papers (2023-04-17T02:12:30Z) - CILP: Co-simulation based Imitation Learner for Dynamic Resource
Provisioning in Cloud Computing Environments [13.864161788250856]
Key challenge for latency-critical tasks is to predict future workload demands to provision proactively.
Existing AI-based solutions tend to not holistically consider all crucial aspects such as provision overheads, heterogeneous VM costs and Quality of Service (QoS) of the cloud system.
We propose a novel method, called CILP, that formulates the VM provisioning problem as two sub-problems of prediction and optimization.
arXiv Detail & Related papers (2023-02-11T09:15:34Z) - Actively Learning Costly Reward Functions for Reinforcement Learning [56.34005280792013]
We show that it is possible to train agents in complex real-world environments orders of magnitudes faster.
By enabling the application of reinforcement learning methods to new domains, we show that we can find interesting and non-trivial solutions.
arXiv Detail & Related papers (2022-11-23T19:17:20Z) - Innovations in the field of on-board scheduling technologies [64.41511459132334]
This paper proposes an onboard scheduler, that integrates inside an onboard software framework for mission autonomy.
The scheduler is based on linear integer programming and relies on the use of a branch-and-cut solver.
The technology has been tested on an Earth Observation scenario, comparing its performance against the state-of-the-art scheduling technology.
arXiv Detail & Related papers (2022-05-04T12:00:49Z) - MCDS: AI Augmented Workflow Scheduling in Mobile Edge Cloud Computing
Systems [12.215537834860699]
Recently proposed scheduling methods leverage the low response times of edge computing platforms to optimize application Quality of Service (QoS)
We propose MCDS: Monte Carlo Learning using Deep Surrogate Models to efficiently schedule workflow applications in mobile edge-cloud computing systems.
arXiv Detail & Related papers (2021-12-14T10:00:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.