In-depth Analysis On Parallel Processing Patterns for High-Performance
Dataframes
- URL: http://arxiv.org/abs/2307.01394v1
- Date: Mon, 3 Jul 2023 23:11:03 GMT
- Title: In-depth Analysis On Parallel Processing Patterns for High-Performance
Dataframes
- Authors: Niranda Perera, Arup Kumar Sarker, Mills Staylor, Gregor von
Laszewski, Kaiying Shan, Supun Kamburugamuve, Chathura Widanage, Vibhatha
Abeykoon, Thejaka Amila Kanewela, Geoffrey Fox
- Abstract summary: We present a set of parallel processing patterns for distributed dataframe operators and the reference runtime implementation, Cylon.
In this paper, we are expanding on the initial concept by introducing a cost model for evaluating the said patterns.
We evaluate the performance of Cylon on the ORNL Summit supercomputer.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The Data Science domain has expanded monumentally in both research and
industry communities during the past decade, predominantly owing to the Big
Data revolution. Artificial Intelligence (AI) and Machine Learning (ML) are
bringing more complexities to data engineering applications, which are now
integrated into data processing pipelines to process terabytes of data.
Typically, a significant amount of time is spent on data preprocessing in these
pipelines, and hence improving its e fficiency directly impacts the overall
pipeline performance. The community has recently embraced the concept of
Dataframes as the de-facto data structure for data representation and
manipulation. However, the most widely used serial Dataframes today (R, pandas)
experience performance limitations while working on even moderately large data
sets. We believe that there is plenty of room for improvement by taking a look
at this problem from a high-performance computing point of view. In a prior
publication, we presented a set of parallel processing patterns for distributed
dataframe operators and the reference runtime implementation, Cylon [1]. In
this paper, we are expanding on the initial concept by introducing a cost model
for evaluating the said patterns. Furthermore, we evaluate the performance of
Cylon on the ORNL Summit supercomputer.
Related papers
- Optimizing VarLiNGAM for Scalable and Efficient Time Series Causal Discovery [5.430532390358285]
Causal discovery is designed to identify causal relationships in data.
Time series causal discovery is particularly challenging due to the need to account for temporal dependencies and potential time lag effects.
This study significantly improves the feasibility of processing large datasets.
arXiv Detail & Related papers (2024-09-09T10:52:58Z) - An Integrated Data Processing Framework for Pretraining Foundation Models [57.47845148721817]
Researchers and practitioners often have to manually curate datasets from difference sources.
We propose a data processing framework that integrates a Processing Module and an Analyzing Module.
The proposed framework is easy to use and highly flexible.
arXiv Detail & Related papers (2024-02-26T07:22:51Z) - D3A-TS: Denoising-Driven Data Augmentation in Time Series [0.0]
This work focuses on studying and analyzing the use of different techniques for data augmentation in time series for classification and regression problems.
The proposed approach involves the use of diffusion probabilistic models, which have recently achieved successful results in the field of Image Processing.
The results highlight the high utility of this methodology in creating synthetic data to train classification and regression models.
arXiv Detail & Related papers (2023-12-09T11:37:07Z) - RINAS: Training with Dataset Shuffling Can Be General and Fast [2.485503195398027]
RINAS is a data loading framework that addresses the performance bottleneck of loading global shuffled datasets.
We implement RINAS under the PyTorch framework for common dataset libraries HuggingFace and TorchVision.
Our experimental results show that RINAS improves the throughput of general language model training and vision model training by up to 59% and 89%, respectively.
arXiv Detail & Related papers (2023-12-04T21:50:08Z) - Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow [49.724842920942024]
Industries such as finance, meteorology, and energy generate vast amounts of data daily.
We propose Data-Copilot, a data analysis agent that autonomously performs querying, processing, and visualization of massive data tailored to diverse human requests.
arXiv Detail & Related papers (2023-06-12T16:12:56Z) - PARTIME: Scalable and Parallel Processing Over Time with Deep Neural
Networks [68.96484488899901]
We present PARTIME, a library designed to speed up neural networks whenever data is continuously streamed over time.
PARTIME starts processing each data sample at the time in which it becomes available from the stream.
Experiments are performed in order to empirically compare PARTIME with classic non-parallel neural computations in online learning.
arXiv Detail & Related papers (2022-10-17T14:49:14Z) - Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning
Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy.
We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines.
We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - Deep Cellular Recurrent Network for Efficient Analysis of Time-Series
Data with Spatial Information [52.635997570873194]
This work proposes a novel deep cellular recurrent neural network (DCRNN) architecture to process complex multi-dimensional time series data with spatial information.
The proposed architecture achieves state-of-the-art performance while utilizing substantially less trainable parameters when compared to comparable methods in the literature.
arXiv Detail & Related papers (2021-01-12T20:08:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.