Related papers: Reproducible and Portable Big Data Analytics in the Cloud

Reproducible and Portable Big Data Analytics in the Cloud

URL: http://arxiv.org/abs/2112.09762v1
Date: Fri, 17 Dec 2021 20:52:03 GMT
Title: Reproducible and Portable Big Data Analytics in the Cloud
Authors: Xin Wang, Pei Guo, Xingyan Li, Jianwu Wang, Aryya Gangopadhyay, Carl E. Busart, Jade Freeman
Abstract summary: There are two main difficulties in reproducing big data applications in the cloud. The first is how to automate end-to-end execution of big data analytics in the cloud. The second is an application developed for one cloud, such as AWS or Azure, is difficult to reproduce in another cloud.
Score: 4.948702463455218
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Cloud computing has become a major approach to enable reproducible computational experiments because of its support of on-demand hardware and software resource provisioning. Yet there are still two main difficulties in reproducing big data applications in the cloud. The first is how to automate end-to-end execution of big data analytics in the cloud including virtual distributed environment provisioning, network and security group setup, and big data analytics pipeline description and execution. The second is an application developed for one cloud, such as AWS or Azure, is difficult to reproduce in another cloud, a.k.a. vendor lock-in problem. To tackle these problems, we leverage serverless computing and containerization techniques for automatic scalable big data application execution and reproducibility, and utilize the adapter design pattern to enable application portability and reproducibility across different clouds. Based on the approach, we propose and develop an open-source toolkit that supports 1) on-demand distributed hardware and software environment provisioning, 2) automatic data and configuration storage for each execution, 3) flexible client modes based on user preferences, 4) execution history query, and 5) simple reproducibility of existing executions in the same environment or a different environment. We did extensive experiments on both AWS and Azure using three big data analytics applications that run on a virtual CPU/GPU cluster. Three main behaviors of our toolkit were benchmarked: i) execution overhead ratio for reproducibility support, ii) differences of reproducing the same application on AWS and Azure in terms of execution time, budgetary cost and cost-performance ratio, iii) differences between scale-out and scale-up approach for the same application on AWS and Azure.

Related papers

Intelligent Load Balancing in Cloud Computer Systems [0.0]
This research was to design a strategy to dynamically allocate tasks without overloading Cloud nodes.<n>The project involved extensive experiments run on the University of Westminster HPC cluster.
arXiv Detail & Related papers (2025-09-22T19:39:08Z)
OpenCUA: Open Foundations for Computer-Use Agents [74.61449905487565]
Vision-language models have demonstrated impressive capabilities as computer-use agents (CUAs)<n>We propose OpenCUA, a comprehensive open-source framework for scaling CUA data and foundation models.<n>Our end-to-end agent models demonstrate strong performance across CUA benchmarks.
arXiv Detail & Related papers (2025-08-12T17:52:32Z)
Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for Foundation Models [64.28420991770382]
We present Data-Juicer 2.0, a new system offering fruitful data processing capabilities backed by over a hundred operators. The system is publicly available, actively maintained, and broadly adopted in diverse research endeavors, practical applications, and real-world products such as Alibaba Cloud PAI.
arXiv Detail & Related papers (2024-12-23T08:29:57Z)
SeBS-Flow: Benchmarking Serverless Cloud Function Workflows [51.4200085836966]
We propose the first serverless workflow benchmarking suite SeBS-Flow. SeBS-Flow includes six real-world application benchmarks and four microbenchmarks representing different computational patterns. We conduct comprehensive evaluations on three major cloud platforms, assessing performance, cost, scalability, and runtime deviations.
arXiv Detail & Related papers (2024-10-04T14:52:18Z)
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? [73.81908518992161]
We introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering. Spider2-V features real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications. These tasks evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems.
arXiv Detail & Related papers (2024-07-15T17:54:37Z)
Green AI: A Preliminary Empirical Study on Energy Consumption in DL Models Across Different Runtime Infrastructures [56.200335252600354]
It is common practice to deploy pre-trained models on environments distinct from their native development settings. This led to the introduction of interchange formats such as ONNX, which includes its infrastructure, and ONNX, which work as standard formats.
arXiv Detail & Related papers (2024-02-21T09:18:44Z)
Prism: Revealing Hidden Functional Clusters from Massive Instances in Cloud Systems [32.18320298895805]
We propose to infer functional clusters of instances, i.e., groups of instances having similar functionalities. We first conduct a pilot study on a large-scale cloud system, Huawei Cloud, demonstrating that instances having similar functionalities share similar communication and resource usage patterns. Motivated by these findings, we formulate the identification of functional clusters as a clustering problem and propose a non-intrusive solution called Prism.
arXiv Detail & Related papers (2023-08-15T08:34:54Z)
A Unified Cloud-Enabled Discrete Event Parallel and Distributed Simulation Architecture [0.7949705607963994]
We present a unified parallel and distributed M&S architecture with enough flexibility to deploy simulations in the Cloud. Our framework is based on the Discrete Event System Specification (DEVS) formalism. The performance of the parallel and distributed framework is tested using the xDEVS M&S tool and the DEVStone benchmark with up to eight computing nodes.
arXiv Detail & Related papers (2023-02-22T09:47:09Z)
SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines. This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z)
Dynamic Network-Assisted D2D-Aided Coded Distributed Learning [59.29409589861241]
We propose a novel device-to-device (D2D)-aided coded federated learning method (D2D-CFL) for load balancing across devices. We derive an optimal compression rate for achieving minimum processing time and establish its connection with the convergence time. Our proposed method is beneficial for real-time collaborative applications, where the users continuously generate training data.
arXiv Detail & Related papers (2021-11-26T18:44:59Z)
Reproducible Performance Optimization of Complex Applications on the Edge-to-Cloud Continuum [55.6313942302582]
We propose a methodology to support the optimization of real-life applications on the Edge-to-Cloud Continuum. Our approach relies on a rigorous analysis of possible configurations in a controlled testbed environment to understand their behaviour. Our methodology can be generalized to other applications in the Edge-to-Cloud Continuum.
arXiv Detail & Related papers (2021-08-04T07:35:14Z)
Dynamic Scheduling for Stochastic Edge-Cloud Computing Environments using A3C learning and Residual Recurrent Neural Networks [30.61220416710614]
A-Advantage-Actor-Critic (A3C) learning is known to quickly adapt to dynamic scenarios with less data and Residual Recurrent Neural Network (R2N2) to quickly update model parameters. We use the R2N2 architecture to capture a large number of host and task parameters together with temporal patterns to provide efficient scheduling decisions. Experiments conducted on real-world data set show a significant improvement in terms of energy consumption, response time, ServiceLevelAgreement and running cost by 14.4%, 7.74%, 31.9%, and 4.64%, respectively.
arXiv Detail & Related papers (2020-09-01T13:36:34Z)
AI-based Resource Allocation: Reinforcement Learning for Adaptive Auto-scaling in Serverless Environments [0.0]
Serverless computing has emerged as a compelling new paradigm of cloud computing models in recent years. A common approach among both commercial and open source serverless computing platforms is workload-based auto-scaling. In this paper we investigate the applicability of a reinforcement learning approach to request-based auto-scaling in a serverless framework.
arXiv Detail & Related papers (2020-05-29T06:18:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.