Related papers: StreamBlocks: A compiler for heterogeneous dataflow computing (technical report)

StreamBlocks: A compiler for heterogeneous dataflow computing (technical report)

URL: http://arxiv.org/abs/2107.09333v1
Date: Tue, 20 Jul 2021 08:46:47 GMT
Title: StreamBlocks: A compiler for heterogeneous dataflow computing (technical report)
Authors: Endri Bezati, Mahyar Emami, J\"orn Janneck, James Larus
Abstract summary: This work introduces StreamBlocks, an open-source compiler and runtime that uses the CAL dataflow programming language to partition computations across platforms. StreamBlocks supports exploring the design space with a profile-guided tool that helps identify the best hardware-software partitions.
Score: 1.5293427903448022
License: http://creativecommons.org/licenses/by/4.0/
Abstract: To increase performance and efficiency, systems use FPGAs as reconfigurable accelerators. A key challenge in designing these systems is partitioning computation between processors and an FPGA. An appropriate division of labor may be difficult to predict in advance and require experiments and measurements. When an investigation requires rewriting part of the system in a new language or with a new programming model, its high cost can retard the study of different configurations. A single-language system with an appropriate programming model and compiler that targets both platforms simplifies this exploration to a simple recompile with new compiler directives. This work introduces StreamBlocks, an open-source compiler and runtime that uses the CAL dataflow programming language to partition computations across heterogeneous (CPU/accelerator) platforms. Because of the dataflow model's semantics and the CAL language, StreamBlocks can exploit both thread parallelism in multi-core CPUs and the inherent parallelism of FPGAs. StreamBlocks supports exploring the design space with a profile-guided tool that helps identify the best hardware-software partitions.

Related papers

NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding [54.88765757043535]
This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference.<n>Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types with less than 7% computational overhead.<n>The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search.
arXiv Detail & Related papers (2025-05-28T20:43:10Z)
An approach to performance portability through generic programming [0.0]
This work describes a design approach that allows the integration of low-level and verbose programming tools into high-level generic algorithms based on template meta-programming in C++. That allows scientific software to be maintainable and efficient in a period of diversifying hardware in HPC.
arXiv Detail & Related papers (2023-11-08T21:54:43Z)
Guess & Sketch: Language Model Guided Transpilation [59.02147255276078]
Learned transpilation offers an alternative to manual re-writing and engineering efforts. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence.
arXiv Detail & Related papers (2023-09-25T15:42:18Z)
Specx: a C++ task-based runtime system for heterogeneous distributed architectures [0.0]
Specx is a task-based runtime system written in modern C++. We present Specx, a task-based runtime system written in modern C++.
arXiv Detail & Related papers (2023-08-30T11:41:30Z)
INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient. We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture. We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z)
Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels. We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z)
QParallel: Explicit Parallelism for Programming Quantum Computers [62.10004571940546]
We present a language extension for parallel quantum programming. QParallel removes ambiguities concerning parallelism in current quantum programming languages. We introduce a tool that guides programmers in the placement of parallel regions by identifying the subroutines that profit most from parallelization.
arXiv Detail & Related papers (2022-10-07T16:35:16Z)
Providing Meaningful Data Summarizations Using Examplar-based Clustering in Industry 4.0 [67.80123919697971]
We show, that our GPU implementation provides speedups of up to 72x using single-precision and up to 452x using half-precision compared to conventional CPU algorithms. We apply our algorithm to real-world data from injection molding manufacturing processes and discuss how found summaries help with steering this specific process to cut costs and reduce the manufacturing of bad parts.
arXiv Detail & Related papers (2021-05-25T15:55:14Z)
Systolic Computing on GPUs for Productive Performance [2.8064596842326575]
We propose a language and compiler to productively build high-performance systolic arrays that run on GPUs. A programmer it' specifies a projection of a dataflow compute onto a linear systolic array, while leaving the detailed implementation of the projection to a compiler. The compiler implements the specified projection and maps the linear systolic array to the SIMD execution units and vector registers of GPUs.
arXiv Detail & Related papers (2020-10-29T18:49:54Z)
Extending C++ for Heterogeneous Quantum-Classical Computing [56.782064931823015]
qcor is a language extension to C++ and compiler implementation that enables heterogeneous quantum-classical programming, compilation, and execution in a single-source context. Our work provides a first-of-its-kind C++ compiler enabling high-level quantum kernel (function) expression in a quantum-language manner.
arXiv Detail & Related papers (2020-10-08T12:49:07Z)
HeAT -- a Distributed and GPU-accelerated Tensor Framework for Data Analytics [0.0]
HeAT is an array-based numerical programming framework for large-scale parallel processing with an easy-to-use NumPy-like API. HeAT utilizes PyTorch as a node-local eager execution engine and distributes the workload on arbitrarily large high-performance computing systems via MPI. When compared to similar frameworks, HeAT achieves speedups of up to two orders of magnitude.
arXiv Detail & Related papers (2020-07-27T13:33:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.