Pathway: a fast and flexible unified stream data processing framework
for analytical and Machine Learning applications
- URL: http://arxiv.org/abs/2307.13116v1
- Date: Wed, 12 Jul 2023 08:27:37 GMT
- Title: Pathway: a fast and flexible unified stream data processing framework
for analytical and Machine Learning applications
- Authors: Michal Bartoszkiewicz, Jan Chorowski, Adrian Kosowski, Jakub Kowalski,
Sergey Kulik, Mateusz Lewandowski, Krzysztof Nowicki, Kamil Piechowiak,
Olivier Ruas, Zuzanna Stamirowska, Przemyslaw Uznanski
- Abstract summary: Pathway is a new unified data processing framework that can run workloads on both bounded and unbounded data streams.
We describe the system and present benchmarking results which demonstrate its capabilities in both batch and streaming contexts.
- Score: 7.850979932441607
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Pathway, a new unified data processing framework that can run
workloads on both bounded and unbounded data streams. The framework was created
with the original motivation of resolving challenges faced when analyzing and
processing data from the physical economy, including streams of data generated
by IoT and enterprise systems. These required rapid reaction while calling for
the application of advanced computation paradigms (machinelearning-powered
analytics, contextual analysis, and other elements of complex event
processing). Pathway is equipped with a Table API tailored for Python and
Python/SQL workflows, and is powered by a distributed incremental dataflow in
Rust. We describe the system and present benchmarking results which demonstrate
its capabilities in both batch and streaming contexts, where it is able to
surpass state-of-the-art industry frameworks in both scenarios. We also discuss
streaming use cases handled by Pathway which cannot be easily resolved with
state-of-the-art industry frameworks, such as streaming iterative graph
algorithms (PageRank, etc.).
Related papers
- Benchmarking Agentic Workflow Generation [80.74757493266057]
We introduce WorFBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures.
We also present WorFEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms.
We observe that the generated can enhance downstream tasks, enabling them to achieve superior performance with less time during inference.
arXiv Detail & Related papers (2024-10-10T12:41:19Z) - A Comprehensive Benchmarking Analysis of Fault Recovery in Stream Processing Frameworks [1.3398445165628463]
This paper provides a comprehensive analysis of fault recovery performance, stability, and recovery time in a cloud-native environment.
Our results indicate that Flink is the most stable and has one of the best fault recovery.
K Kafka Streams shows suitable fault recovery performance and stability, but with higher event latency.
arXiv Detail & Related papers (2024-04-09T10:49:23Z) - ShuffleBench: A Benchmark for Large-Scale Data Shuffling Operations with
Distributed Stream Processing Frameworks [1.4374467687356276]
This paper introduces ShuffleBench, a novel benchmark to evaluate the performance of modern stream processing frameworks.
ShuffleBench is inspired by requirements for near real-time analytics of a large cloud observability platform.
Our results show that Flink achieves the highest throughput while Hazelcast processes data streams with the lowest latency.
arXiv Detail & Related papers (2024-03-07T15:06:24Z) - In Situ Framework for Coupling Simulation and Machine Learning with
Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations.
As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks.
This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - Automated Evolutionary Approach for the Design of Composite Machine
Learning Pipelines [48.7576911714538]
The proposed approach is aimed to automate the design of composite machine learning pipelines.
It designs the pipelines with a customizable graph-based structure, analyzes the obtained results, and reproduces them.
The software implementation on this approach is presented as an open-source framework.
arXiv Detail & Related papers (2021-06-26T23:19:06Z) - Automated Machine Learning Techniques for Data Streams [91.3755431537592]
This paper surveys the state-of-the-art open-source AutoML tools, applies them to data collected from streams, and measures how their performance changes over time.
The results show that off-the-shelf AutoML tools can provide satisfactory results but in the presence of concept drift, detection or adaptation techniques have to be applied to maintain the predictive accuracy over time.
arXiv Detail & Related papers (2021-06-14T11:42:46Z) - FENXI: Deep-learning Traffic Analytics at the Edge [69.34903175081284]
We present FENXI, a system to run complex analytics by leveraging TPU.
FENXI decouples operations and traffic analytics which operates at different granularities.
Our analysis shows that FENXI can sustain forwarding line rate traffic processing requiring only limited resources.
arXiv Detail & Related papers (2021-05-25T08:02:44Z) - A Query Language for Summarizing and Analyzing Business Process Data [6.952242545832663]
We present a framework to model process data as graphs, i.e., Process Graph, and present abstractions to summarize the process graph.
We have implemented a scalable architecture for querying, exploration and analysis of process graphs.
arXiv Detail & Related papers (2021-05-23T11:07:53Z) - Ranking and benchmarking framework for sampling algorithms on synthetic
data streams [0.0]
In big data, AI, and streaming processing, we work with large amounts of data from multiple sources.
Due to memory and network limitations, we process data streams on distributed systems to alleviate computational and network loads.
We provide algorithms that react to concept drifts and compare those against the state-of-the-art algorithms using our framework.
arXiv Detail & Related papers (2020-06-17T14:25:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.