Data Engineering for HPC with Python
- URL: http://arxiv.org/abs/2010.06312v1
- Date: Tue, 13 Oct 2020 11:53:11 GMT
- Title: Data Engineering for HPC with Python
- Authors: Vibhatha Abeykoon, Niranda Perera, Chathura Widanage, Supun
Kamburugamuve, Thejaka Amila Kanewala, Hasara Maithree, Pulasthi
Wickramasinghe, Ahmet Uyar and Geoffrey Fox
- Abstract summary: Data engineering deals with a variety of data formats, storage, data extraction, transformation, and data movements.
One goal of data engineering is to transform data from original data to vector/matrix/tensor formats accepted by deep learning and machine learning applications.
We present a distributed Python API based on table abstraction for representing and processing data.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data engineering is becoming an increasingly important part of scientific
discoveries with the adoption of deep learning and machine learning. Data
engineering deals with a variety of data formats, storage, data extraction,
transformation, and data movements. One goal of data engineering is to
transform data from original data to vector/matrix/tensor formats accepted by
deep learning and machine learning applications. There are many structures such
as tables, graphs, and trees to represent data in these data engineering
phases. Among them, tables are a versatile and commonly used format to load and
process data. In this paper, we present a distributed Python API based on table
abstraction for representing and processing data. Unlike existing
state-of-the-art data engineering tools written purely in Python, our solution
adopts high performance compute kernels in C++, with an in-memory table
representation with Cython-based Python bindings. In the core system, we use
MPI for distributed memory computations with a data-parallel approach for
processing large datasets in HPC clusters.
Related papers
- Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models [64.28420991770382]
Data-Juicer 2.0 is a data processing system backed by data processing operators spanning text, image, video, and audio modalities.<n>It supports more critical tasks including data analysis, annotation, and foundation model post-training.<n>It has been widely adopted in diverse research fields and real-world products such as Alibaba Cloud PAI.
arXiv Detail & Related papers (2024-12-23T08:29:57Z) - Relational Deep Learning: Graph Representation Learning on Relational
Databases [69.7008152388055]
We introduce an end-to-end representation approach to learn on data laid out across multiple tables.
Message Passing Graph Neural Networks can then automatically learn across the graph to extract representations that leverage all data input.
arXiv Detail & Related papers (2023-12-07T18:51:41Z) - In-depth Analysis On Parallel Processing Patterns for High-Performance
Dataframes [0.0]
We present a set of parallel processing patterns for distributed dataframe operators and the reference runtime implementation, Cylon.
In this paper, we are expanding on the initial concept by introducing a cost model for evaluating the said patterns.
We evaluate the performance of Cylon on the ORNL Summit supercomputer.
arXiv Detail & Related papers (2023-07-03T23:11:03Z) - PARTIME: Scalable and Parallel Processing Over Time with Deep Neural
Networks [68.96484488899901]
We present PARTIME, a library designed to speed up neural networks whenever data is continuously streamed over time.
PARTIME starts processing each data sample at the time in which it becomes available from the stream.
Experiments are performed in order to empirically compare PARTIME with classic non-parallel neural computations in online learning.
arXiv Detail & Related papers (2022-10-17T14:49:14Z) - DataLab: A Platform for Data Analysis and Intervention [96.75253335629534]
DataLab is a unified data-oriented platform that allows users to interactively analyze the characteristics of data.
toolname has features for dataset recommendation and global vision analysis.
So far, DataLab covers 1,715 datasets and 3,583 of its transformed version.
arXiv Detail & Related papers (2022-02-25T18:32:19Z) - From Strings to Data Science: a Practical Framework for Automated String
Handling [0.4079265319364249]
Many machine learning libraries require that string features be converted to a numerical representation for the models to work as intended.
In this paper, we propose a framework to do so based on best practices, domain knowledge, and novel techniques.
It automatically identifies different types of string features, processes them accordingly, and encodes them into numerical representations.
arXiv Detail & Related papers (2021-11-02T20:09:03Z) - PTRAIL -- A python package for parallel trajectory data preprocessing [2.348339658768759]
Trajectory data represent a trace of an object that changes its position in space over time.
There is a need for software specifically tailored to preprocess trajectory data.
We propose PTRAIL, a python package offering several trajectory preprocessing steps.
arXiv Detail & Related papers (2021-08-26T20:14:07Z) - HPTMT: Operator-Based Architecture for ScalableHigh-Performance
Data-Intensive Frameworks [0.0]
High-Performance Matrices and Tables (HPTMT) is an operator-based architecture for data-intensive applications.
HPTMT is inspired by systems like MPI, HPF, NumPy, Pandas, Modin, PyTorch, Spark, RAPIDS( NVIDIA), and OneAPI (Intel)
In this paper, we propose High-Performance Matrices and Tables (HPTMT), an operator-based architecture for data-intensive applications.
arXiv Detail & Related papers (2021-07-27T13:28:34Z) - giotto-tda: A Topological Data Analysis Toolkit for Machine Learning and
Data Exploration [4.8353738137338755]
giotto-tda is a Python library that integrates high-performance topological data analysis with machine learning.
The library's ability to handle various types of data is rooted in a wide range of preprocessing techniques.
arXiv Detail & Related papers (2020-04-06T10:53:57Z) - PyODDS: An End-to-end Outlier Detection System with Automated Machine
Learning [55.32009000204512]
We present PyODDS, an automated end-to-end Python system for Outlier Detection with Database Support.
Specifically, we define the search space in the outlier detection pipeline, and produce a search strategy within the given search space.
It also provides unified interfaces and visualizations for users with or without data science or machine learning background.
arXiv Detail & Related papers (2020-03-12T03:30:30Z) - OPFython: A Python-Inspired Optimum-Path Forest Classifier [68.8204255655161]
This paper proposes a Python-based Optimum-Path Forest framework, denoted as OPFython.
As OPFython is a Python-based library, it provides a more friendly environment and a faster prototyping workspace than the C language.
arXiv Detail & Related papers (2020-01-28T15:46:19Z) - Multi-layer Optimizations for End-to-End Data Analytics [71.05611866288196]
We introduce Iterative Functional Aggregate Queries (IFAQ), a framework that realizes an alternative approach.
IFAQ treats the feature extraction query and the learning task as one program given in the IFAQ's domain-specific language.
We show that a Scala implementation of IFAQ can outperform mlpack, Scikit, and specialization by several orders of magnitude for linear regression and regression tree models over several relational datasets.
arXiv Detail & Related papers (2020-01-10T16:14:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.