Related papers: Dato: A Task-Based Programming Model for Dataflow Accelerators

Dato: A Task-Based Programming Model for Dataflow Accelerators

URL: http://arxiv.org/abs/2509.06794v1
Date: Mon, 08 Sep 2025 15:22:51 GMT
Title: Dato: A Task-Based Programming Model for Dataflow Accelerators
Authors: Shihan Fang, Hongzheng Chen, Niansong Zhang, Jiajie Li, Han Meng, Adrian Liu, Zhiru Zhang,
Abstract summary: We present Dato, a Python-embedded, task-based programming model for dataflow accelerators.<n>Dato elevates data communication and sharding to first-class type constructs.<n>Dato achieves high performance while significantly reducing the burden of writing optimized code.
Score: 13.87015257740592
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent deep learning workloads increasingly push computational demand beyond what current memory systems can sustain, with many kernels stalling on data movement rather than computation. While modern dataflow accelerators incorporate on-chip streaming to mitigate off-chip bandwidth limitations, existing programming models struggle to harness these capabilities effectively. Low-level interfaces provide fine-grained control but impose significant development overhead, whereas high-level tile-based languages abstract away communication details, restricting optimization and forcing compilers to reconstruct the intended dataflow. We present Dato, a Python-embedded, task-based programming model for dataflow accelerators that elevates data communication and sharding to first-class type constructs. Developers write programs as a graph of tasks connected via explicit stream types, with sharded inputs specified using layout types. These tasks are first mapped virtually onto the accelerator's spatial fabric, and the compiler then generates a physical mapping that respects hardware constraints. Experimental results on both AMD Ryzen AI NPU and Alveo FPGA devices demonstrate that Dato achieves high performance while significantly reducing the burden of writing optimized code. On the NPU, Dato attains up to 84% hardware utilization for GEMM and delivers a 2.81x speedup on attention kernels compared to a state-of-the-art commercial framework. On the FPGA, Dato surpasses leading frameworks in performance when generating custom systolic arrays, achieving 98% of the theoretical peak performance.

Related papers

Continuous-Flow Data-Rate-Aware CNN Inference on FPGA [6.473184145566098]
This work presents a novel approach to designing data-rate-aware, continuous-flow CNN architectures.<n>The proposed approach ensures a high hardware utilization close to 100% by interleaving low data rate signals and sharing hardware units.<n>The results show that a significant amount of arithmetic logic can be saved, which allows implementing complex CNNs like MobileNet on a single FPGA with high throughput.
arXiv Detail & Related papers (2026-01-16T17:27:19Z)
NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding [54.88765757043535]
This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference.<n>Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types with less than 7% computational overhead.<n>The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search.
arXiv Detail & Related papers (2025-05-28T20:43:10Z)
TileLang: A Composable Tiled Programming Model for AI Systems [17.240134151647187]
We present TileLang, a generalized tiled programming model for more efficient AI programming.<n> TileLang decouples scheduling space (thread binding, layout, tensorize and pipeline) from dataflow, and encapsulated them as a set of customization annotations and primitives.<n>We conduct comprehensive experiments on commonly-used devices, across numerous experiments, our evaluation shows that TileLang can achieve state-of-the-art performance in key kernels.
arXiv Detail & Related papers (2025-04-24T14:08:49Z)
InTAR: Inter-Task Auto-Reconfigurable Accelerator Design for High Data Volume Variation in DNNs [5.762543012823378]
InTAR is a novel accelerator design methodology for HDV applications on FPGAs.<n>It switches execution patterns automatically with a static schedule determined before circuit design.<n>InTAR achieves a high clock frequency with fewer resources and low reconfiguration time.
arXiv Detail & Related papers (2025-02-12T21:43:51Z)
Code generation and runtime techniques for enabling data-efficient deep learning training on GPUs [8.00550423071637]
This dissertation analyzes data inefficiency in representative deep training tasks, specifically in graph neural networks (GNNs) and large language models (LLMs)<n>It proposes novel runtime and code generation techniques to mitigate these challenges and implements these optimizations seamlessly within the PyTorch stack.
arXiv Detail & Related papers (2024-12-06T03:20:03Z)
HASS: Hardware-Aware Sparsity Search for Dataflow DNN Accelerator [47.66463010685586]
We propose a novel approach to exploit unstructured weights and activations sparsity for dataflow accelerators, using software and hardware co-optimization. We achieve an efficiency improvement ranging from 1.3$times$ to 4.2$times$ compared to existing sparse designs.
arXiv Detail & Related papers (2024-06-05T09:25:18Z)
INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient. We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture. We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z)
Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels. We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z)
PARTIME: Scalable and Parallel Processing Over Time with Deep Neural Networks [68.96484488899901]
We present PARTIME, a library designed to speed up neural networks whenever data is continuously streamed over time. PARTIME starts processing each data sample at the time in which it becomes available from the stream. Experiments are performed in order to empirically compare PARTIME with classic non-parallel neural computations in online learning.
arXiv Detail & Related papers (2022-10-17T14:49:14Z)
Providing Meaningful Data Summarizations Using Examplar-based Clustering in Industry 4.0 [67.80123919697971]
We show, that our GPU implementation provides speedups of up to 72x using single-precision and up to 452x using half-precision compared to conventional CPU algorithms. We apply our algorithm to real-world data from injection molding manufacturing processes and discuss how found summaries help with steering this specific process to cut costs and reduce the manufacturing of bad parts.
arXiv Detail & Related papers (2021-05-25T15:55:14Z)
Towards High Performance Java-based Deep Learning Frameworks [0.22940141855172028]
Modern cloud services have set the demand for fast and efficient data processing. This demand is common among numerous application domains, such as deep learning, data mining, and computer vision. In this paper we have employed TornadoVM, a state-of-the-art programming framework to transparently accelerate Deep Netts; a Java-based deep learning framework.
arXiv Detail & Related papers (2020-01-13T13:03:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.