Taskflow: A Lightweight Parallel and Heterogeneous Task Graph Computing
System
- URL: http://arxiv.org/abs/2004.10908v4
- Date: Mon, 6 Sep 2021 18:36:40 GMT
- Title: Taskflow: A Lightweight Parallel and Heterogeneous Task Graph Computing
System
- Authors: Tsung-Wei Huang, Dian-Lun Lin, Chun-Xun Lin, and Yibo Lin
- Abstract summary: Taskflow aims to streamline the building of parallel and heterogeneous applications using a lightweight task graph-based approach.
Our programming model distinguishes itself as a very general class of task graph parallelism with in-graph control flow.
We have demonstrated the promising performance of Taskflow in real-world applications.
- Score: 12.813275501138193
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Taskflow aims to streamline the building of parallel and heterogeneous
applications using a lightweight task graph-based approach. Taskflow introduces
an expressive task graph programming model to assist developers in the
implementation of parallel and heterogeneous decomposition strategies on a
heterogeneous computing platform. Our programming model distinguishes itself as
a very general class of task graph parallelism with in-graph control flow to
enable end-to-end parallel optimization. To support our model with high
performance, we design an efficient system runtime that solves many of the new
scheduling challenges arising out of our models and optimizes the performance
across latency, energy efficiency, and throughput. We have demonstrated the
promising performance of Taskflow in real-world applications. As an example,
Taskflow solves a large-scale machine learning workload up to 29% faster, 1.5x
less memory, and 1.9x higher throughput than the industrial system, oneTBB, on
a machine of 40 CPUs and 4 GPUs. We have opened the source of Taskflow and
deployed it to large numbers of users in the open-source community.
Related papers
- Benchmarking Agentic Workflow Generation [80.74757493266057]
We introduce WorFBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures.
We also present WorFEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms.
We observe that the generated can enhance downstream tasks, enabling them to achieve superior performance with less time during inference.
arXiv Detail & Related papers (2024-10-10T12:41:19Z) - Specx: a C++ task-based runtime system for heterogeneous distributed
architectures [0.0]
Specx is a task-based runtime system written in modern C++.
We present Specx, a task-based runtime system written in modern C++.
arXiv Detail & Related papers (2023-08-30T11:41:30Z) - Automatic Task Parallelization of Dataflow Graphs in ML/DL models [0.0]
We present a Linear Clustering approach to exploit inherent parallel paths in ML dataflow graphs.
We generate readable and executable parallel Pytorch+Python code from input ML models in ONNX format.
Preliminary results on several ML graphs demonstrate up to 1.9$times$ speedup over serial execution.
arXiv Detail & Related papers (2023-08-22T04:54:30Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Partitioning Distributed Compute Jobs with Reinforcement Learning and
Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields.
Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices.
We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z) - A heuristic method for data allocation and task scheduling on
heterogeneous multiprocessor systems under memory constraints [14.681986126866452]
This paper focuses on the data allocation and task scheduling problem under memory constraints.
We propose a tabu search algorithm (TS) which combines several distinguished features.
Experimental results show that the the proposed TS algorithm can obtain relatively high-quality solutions in a reasonable computational time.
arXiv Detail & Related papers (2022-05-09T10:46:08Z) - Arch-Graph: Acyclic Architecture Relation Predictor for
Task-Transferable Neural Architecture Search [96.31315520244605]
Arch-Graph is a transferable NAS method that predicts task-specific optimal architectures.
We show Arch-Graph's transferability and high sample efficiency across numerous tasks.
It is able to find top 0.16% and 0.29% architectures on average on two search spaces under the budget of only 50 models.
arXiv Detail & Related papers (2022-04-12T16:46:06Z) - HyperGrid: Efficient Multi-Task Transformers with Grid-wise Decomposable
Hyper Projections [96.64246471034195]
We propose textscHyperGrid, a new approach for highly effective multi-task learning.
Our method helps bridge the gap between fine-tuning and multi-task learning approaches.
arXiv Detail & Related papers (2020-07-12T02:49:16Z) - Heterogeneous CPU+GPU Stochastic Gradient Descent Algorithms [1.3249453757295084]
We study training algorithms for deep learning on heterogeneous CPU+GPU architectures.
Our two-fold objective -- maximize convergence rate and resource utilization simultaneously -- makes the problem challenging.
We show that the implementation of these algorithms achieves both faster convergence and higher resource utilization than on several real datasets.
arXiv Detail & Related papers (2020-04-19T05:21:20Z) - MPLP++: Fast, Parallel Dual Block-Coordinate Ascent for Dense Graphical
Models [96.1052289276254]
This work introduces a new MAP-solver, based on the popular Dual Block-Coordinate Ascent principle.
Surprisingly, by making a small change to the low-performing solver, we derive the new solver MPLP++ that significantly outperforms all existing solvers by a large margin.
arXiv Detail & Related papers (2020-04-16T16:20:53Z) - Optimizing Streaming Parallelism on Heterogeneous Many-Core
Architectures: A Machine Learning Based Approach [16.702537371391053]
This article presents an automatic approach to derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures.
Our approach employs a performance model to estimate the resulting performance of the target application under a given resource partition and task granularity configuration.
Compared to the single-stream version, our approach achieves a 1.6x and 1.1x speedup on the XeonPhi and the GPU platform, respectively.
arXiv Detail & Related papers (2020-03-05T21:18:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.