Cost-Effective Big Data Orchestration Using Dagster: A Multi-Platform Approach
- URL: http://arxiv.org/abs/2408.11635v1
- Date: Wed, 21 Aug 2024 14:05:35 GMT
- Title: Cost-Effective Big Data Orchestration Using Dagster: A Multi-Platform Approach
- Authors: Hernan Picatto, Georg Heiler, Peter Klimek,
- Abstract summary: This paper introduces a cost-effective and flexible orchestration framework using Dagster.
We demonstrate how Dagster's orchestration capabilities can enhance data processing efficiency, enforce best coding practices, and significantly reduce operational costs.
- Score: 0.10241134756773229
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid advancement of big data technologies has underscored the need for robust and efficient data processing solutions. Traditional Spark-based Platform-as-a-Service (PaaS) solutions, such as Databricks and Amazon Web Services Elastic MapReduce, provide powerful analytics capabilities but often result in high operational costs and vendor lock-in issues. These platforms, while user-friendly, can lead to significant inefficiencies due to their cost structures and lack of transparent pricing. This paper introduces a cost-effective and flexible orchestration framework using Dagster. Our solution aims to reduce dependency on any single PaaS provider by integrating various Spark execution environments. We demonstrate how Dagster's orchestration capabilities can enhance data processing efficiency, enforce best coding practices, and significantly reduce operational costs. In our implementation, we achieved a 12% performance improvement over EMR and a 40% cost reduction compared to DBR, translating to over 300 euros saved per pipeline run. Our goal is to provide a flexible, developer-controlled computing environment that maintains or improves performance and scalability while mitigating the risks associated with vendor lock-in. The proposed framework supports rapid prototyping and testing, which is essential for continuous development and operational efficiency, contributing to a more sustainable model of large data processing.
Related papers
- Big Data Workload Profiling for Energy-Aware Cloud Resource Management [0.0]
This paper presents a workload aware and energy efficient scheduling framework.<n>It profiles utilization, memory demand, and storage IO behavior to guide virtual machine placement decisions.<n>Results demonstrate consistent energy savings of 15 to 20 percent compared to a baseline scheduler.
arXiv Detail & Related papers (2026-01-17T06:50:51Z) - LeJOT: An Intelligent Job Cost Orchestration Solution for Databricks Platform [28.16213013287002]
We introduce LeJOT, an intelligent job cost orchestration framework for Databricks jobs.<n>LeJOT proactively predicts workload demands, dynamically allocates computing resources, and minimizes costs.<n>We show that LeJOT achieves an average 20% reduction in cloud computing costs within a minute-level scheduling timeframe.
arXiv Detail & Related papers (2025-12-20T08:09:58Z) - DecEx-RAG: Boosting Agentic Retrieval-Augmented Generation with Decision and Execution Optimization via Process Supervision [50.89715397781075]
Agentic Retrieval-Augmented Generation (Agentic RAG) enhances the processing capability for complex tasks.<n>We propose DecEx-RAG, which models RAG as a Markov Decision Process (MDP) incorporating decision-making and execution.<n>We show that DecEx-RAG achieves an average absolute performance improvement of $6.2%$ across six datasets.
arXiv Detail & Related papers (2025-10-07T08:49:22Z) - Dynamic Speculative Agent Planning [57.630218933994534]
Large language-model-based agents face critical deployment challenges due to prohibitive latency and inference costs.<n>We introduce Dynamic Speculative Planning (DSP), an online reinforcement learning framework that provides lossless acceleration with substantially reduced costs.<n>Experiments on two standard agent benchmarks demonstrate that DSP achieves comparable efficiency to the fastest acceleration method while reducing total cost by 30% and unnecessary cost up to 60%.
arXiv Detail & Related papers (2025-09-02T03:34:36Z) - TagRouter: Learning Route to LLMs through Tags for Open-Domain Text Generation Tasks [6.621120466118939]
Model routing allocates queries to the suitable model, improving system performance while reducing costs.<n>We propose Tag, a training-free model routing method designed to optimize the synergy among multiple large language models (LLM)<n> Experimental results demonstrate that Tag outperforms 13 baseline methods, increasing the accept rate of system by 6.15% and reducing costs by 17.20%, achieving optimal cost-efficiency.
arXiv Detail & Related papers (2025-06-14T12:17:47Z) - Design and Evaluation of a Microservices Cloud Framework for Online Travel Platforms [1.03590082373586]
This paper analyses and integrates a unique Microservices Cloud Framework designed to support Online Travel Platforms (MCF-OTP)<n>MCF-OTPs main goal is to increase the performance, flexibility, and maintenance of online travel platforms via cloud computing and microservice technologies.
arXiv Detail & Related papers (2025-05-20T15:36:55Z) - From Large to Super-Tiny: End-to-End Optimization for Cost-Efficient LLMs [23.253571170594455]
Large Language Models (LLMs) have significantly advanced artificial intelligence.<n>This paper introduces a three-stage cost-efficient end-to-end LLM deployment pipeline.<n>It produces super-tiny online models with enhanced performance and reduced costs.
arXiv Detail & Related papers (2025-04-18T05:25:22Z) - Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for Foundation Models [64.28420991770382]
We present Data-Juicer 2.0, a new system offering fruitful data processing capabilities backed by over a hundred operators.
The system is publicly available, actively maintained, and broadly adopted in diverse research endeavors, practical applications, and real-world products such as Alibaba Cloud PAI.
arXiv Detail & Related papers (2024-12-23T08:29:57Z) - The Effect of Scheduling and Preemption on the Efficiency of LLM Inference Serving [8.552242818726347]
INFERMAX is an analytical framework that uses inference cost models to compare various schedulers.
Our findings indicate that preempting requests can reduce GPU costs by 30% compared to avoiding preemptions at all.
arXiv Detail & Related papers (2024-11-12T00:10:34Z) - CEBench: A Benchmarking Toolkit for the Cost-Effectiveness of LLM Pipelines [29.25579967636023]
We introduce CEBench, an open-source toolkit for benchmarking online large language models.
It focuses on the critical trade-offs between expenditure and effectiveness required for LLM deployments.
This capability supports crucial decision-making processes aimed at maximizing effectiveness while minimizing cost impacts.
arXiv Detail & Related papers (2024-06-20T21:36:00Z) - Cost-Sensitive Multi-Fidelity Bayesian Optimization with Transfer of Learning Curve Extrapolation [55.75188191403343]
We introduce utility, which is a function predefined by each user and describes the trade-off between cost and performance of BO.
We validate our algorithm on various LC datasets and found it outperform all the previous multi-fidelity BO and transfer-BO baselines we consider.
arXiv Detail & Related papers (2024-05-28T07:38:39Z) - Efficient Architecture Search via Bi-level Data Pruning [70.29970746807882]
This work pioneers an exploration into the critical role of dataset characteristics for DARTS bi-level optimization.
We introduce a new progressive data pruning strategy that utilizes supernet prediction dynamics as the metric.
Comprehensive evaluations on the NAS-Bench-201 search space, DARTS search space, and MobileNet-like search space validate that BDP reduces search costs by over 50%.
arXiv Detail & Related papers (2023-12-21T02:48:44Z) - Federated Learning of Large Language Models with Parameter-Efficient
Prompt Tuning and Adaptive Optimization [71.87335804334616]
Federated learning (FL) is a promising paradigm to enable collaborative model training with decentralized data.
The training process of Large Language Models (LLMs) generally incurs the update of significant parameters.
This paper proposes an efficient partial prompt tuning approach to improve performance and efficiency simultaneously.
arXiv Detail & Related papers (2023-10-23T16:37:59Z) - Towards General and Efficient Online Tuning for Spark [55.30868031221838]
We present a general and efficient Spark tuning framework that can deal with the three issues simultaneously.
We have implemented this framework as an independent cloud service, and applied it to the data platform in Tencent.
arXiv Detail & Related papers (2023-09-05T02:16:45Z) - VFed-SSD: Towards Practical Vertical Federated Advertising [53.08038962443853]
We propose a semi-supervised split distillation framework VFed-SSD to alleviate the two limitations.
Specifically, we develop a self-supervised task MatchedPair Detection (MPD) to exploit the vertically partitioned unlabeled data.
Our framework provides an efficient federation-enhanced solution for real-time display advertising with minimal deploying cost and significant performance lift.
arXiv Detail & Related papers (2022-05-31T17:45:30Z) - FedDUAP: Federated Learning with Dynamic Update and Adaptive Pruning
Using Shared Data on the Server [64.94942635929284]
Federated Learning (FL) suffers from two critical challenges, i.e., limited computational resources and low training efficiency.
We propose a novel FL framework, FedDUAP, to exploit the insensitive data on the server and the decentralized data in edge devices.
By integrating the two original techniques together, our proposed FL model, FedDUAP, significantly outperforms baseline approaches in terms of accuracy (up to 4.8% higher), efficiency (up to 2.8 times faster), and computational cost (up to 61.9% smaller)
arXiv Detail & Related papers (2022-04-25T10:00:00Z) - AdaSplit: Adaptive Trade-offs for Resource-constrained Distributed Deep
Learning [18.3841463794885]
Split learning (SL) reduces client compute load by splitting the model training between client and server.
AdaSplit enables efficiently scaling SL to low resource scenarios by reducing bandwidth consumption and improving performance across heterogeneous clients.
arXiv Detail & Related papers (2021-12-02T23:33:15Z) - High-performance, Distributed Training of Large-scale Deep Learning
Recommendation Models [18.63017668881868]
Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook.
In this paper we discuss the SW/HW co-designed solution for high-performance distributed training of large-scale DLRMs.
We demonstrate the capability to train very large DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup in terms of time to solution over previous systems.
arXiv Detail & Related papers (2021-04-12T02:15:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.