HPTMT: Operator-Based Architecture for ScalableHigh-Performance
Data-Intensive Frameworks
- URL: http://arxiv.org/abs/2107.12807v1
- Date: Tue, 27 Jul 2021 13:28:34 GMT
- Title: HPTMT: Operator-Based Architecture for ScalableHigh-Performance
Data-Intensive Frameworks
- Authors: Supun Kamburugamuve, Chathura Widanage, Niranda Perera, Vibhatha
Abeykoon, Ahmet Uyar, Thejaka Amila Kanewala, Gregor von Laszewski, and
Geoffrey Fox
- Abstract summary: High-Performance Matrices and Tables (HPTMT) is an operator-based architecture for data-intensive applications.
HPTMT is inspired by systems like MPI, HPF, NumPy, Pandas, Modin, PyTorch, Spark, RAPIDS( NVIDIA), and OneAPI (Intel)
In this paper, we propose High-Performance Matrices and Tables (HPTMT), an operator-based architecture for data-intensive applications.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data-intensive applications impact many domains, and their steadily
increasing size and complexity demands high-performance, highly usable
environments. We integrate a set of ideas developed in various data science and
data engineering frameworks. They employ a set of operators on specific data
abstractions that include vectors, matrices, tensors, graphs, and tables. Our
key concepts are inspired from systems like MPI, HPF (High-Performance
Fortran), NumPy, Pandas, Spark, Modin, PyTorch, TensorFlow, RAPIDS(NVIDIA), and
OneAPI (Intel). Further, it is crucial to support different languages in
everyday use in the Big Data arena, including Python, R, C++, and Java. We note
the importance of Apache Arrow and Parquet for enabling language agnostic high
performance and interoperability. In this paper, we propose High-Performance
Tensors, Matrices and Tables (HPTMT), an operator-based architecture for
data-intensive applications, and identify the fundamental principles needed for
performance and usability success. We illustrate these principles by a
discussion of examples using our software environments, Cylon and Twister2 that
embody HPTMT.
Related papers
- MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark [70.47478110973042]
We introduce MMTU, a large-scale benchmark with over 30K questions across 25 real-world table tasks.<n> MMTU is designed to comprehensively evaluate models ability to understand, reason, and manipulate real tables at the expert-level.<n>We show that MMTU require a combination of skills -- including table understanding, reasoning, and coding -- that remain challenging for today's frontier models.
arXiv Detail & Related papers (2025-06-05T21:05:03Z) - Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for Foundation Models [64.28420991770382]
We present Data-Juicer 2.0, a new system offering fruitful data processing capabilities backed by over a hundred operators.
The system is publicly available, actively maintained, and broadly adopted in diverse research endeavors, practical applications, and real-world products such as Alibaba Cloud PAI.
arXiv Detail & Related papers (2024-12-23T08:29:57Z) - TensorBank: Tensor Lakehouse for Foundation Model Training [1.8811254972035676]
Streaming and storing high dimensional data for foundation model training became a critical requirement with the rise of foundation models beyond natural language.
We introduceBank, a petabyte scale tensor lakehouse capable of streaming tensors from Cloud Object Store (COS) to GPU memory at wire speed based on complex relational queries.
This architecture generalizes to other use case like computer vision, computational neuroscience, biological sequence analysis and more.
arXiv Detail & Related papers (2023-09-05T10:00:33Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel [19.24542340170026]
We introduce PyTorch Fully Sharded Data Parallel (FSDP) as an industry-grade solution for large model training.
FSDP provides support for significantly larger models with near-linear scalability in terms of TFLOPS.
arXiv Detail & Related papers (2023-04-21T23:52:27Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Desbordante: from benchmarking suite to high-performance
science-intensive data profiler (preprint) [36.537985747809245]
Desbordante is a high-performance science-intensive data profiler with open source code.
Unlike similar systems, it is built with emphasis on industrial application in a multi-user environment.
It is efficient, resilient to crashes, and scalable.
arXiv Detail & Related papers (2023-01-14T19:14:51Z) - Tevatron: An Efficient and Flexible Toolkit for Dense Retrieval [60.457378374671656]
Tevatron is a dense retrieval toolkit optimized for efficiency, flexibility, and code simplicity.
We show how Tevatron's flexible design enables easy generalization across datasets, model architectures, and accelerator platforms.
arXiv Detail & Related papers (2022-03-11T05:47:45Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - HPTMT Parallel Operators for High Performance Data Science & Data
Engineering [0.0]
HPTMT architecture identifies a set of data structures, operators, and an execution model for creating rich data applications.
This paper elaborates and illustrates this architecture using an end-to-end application with deep learning and data engineering parts working together.
arXiv Detail & Related papers (2021-08-13T00:05:43Z) - Data Engineering for HPC with Python [0.0]
Data engineering deals with a variety of data formats, storage, data extraction, transformation, and data movements.
One goal of data engineering is to transform data from original data to vector/matrix/tensor formats accepted by deep learning and machine learning applications.
We present a distributed Python API based on table abstraction for representing and processing data.
arXiv Detail & Related papers (2020-10-13T11:53:11Z) - KILT: a Benchmark for Knowledge Intensive Language Tasks [102.33046195554886]
We present a benchmark for knowledge-intensive language tasks (KILT)
All tasks in KILT are grounded in the same snapshot of Wikipedia.
We find that a shared dense vector index coupled with a seq2seq model is a strong baseline.
arXiv Detail & Related papers (2020-09-04T15:32:19Z) - Multi-layer Optimizations for End-to-End Data Analytics [71.05611866288196]
We introduce Iterative Functional Aggregate Queries (IFAQ), a framework that realizes an alternative approach.
IFAQ treats the feature extraction query and the learning task as one program given in the IFAQ's domain-specific language.
We show that a Scala implementation of IFAQ can outperform mlpack, Scikit, and specialization by several orders of magnitude for linear regression and regression tree models over several relational datasets.
arXiv Detail & Related papers (2020-01-10T16:14:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.