StreamBlocks: A compiler for heterogeneous dataflow computing (technical
report)
- URL: http://arxiv.org/abs/2107.09333v1
- Date: Tue, 20 Jul 2021 08:46:47 GMT
- Title: StreamBlocks: A compiler for heterogeneous dataflow computing (technical
report)
- Authors: Endri Bezati, Mahyar Emami, J\"orn Janneck, James Larus
- Abstract summary: This work introduces StreamBlocks, an open-source compiler and runtime that uses the CAL dataflow programming language to partition computations across platforms.
StreamBlocks supports exploring the design space with a profile-guided tool that helps identify the best hardware-software partitions.
- Score: 1.5293427903448022
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To increase performance and efficiency, systems use FPGAs as reconfigurable
accelerators. A key challenge in designing these systems is partitioning
computation between processors and an FPGA. An appropriate division of labor
may be difficult to predict in advance and require experiments and
measurements. When an investigation requires rewriting part of the system in a
new language or with a new programming model, its high cost can retard the
study of different configurations. A single-language system with an appropriate
programming model and compiler that targets both platforms simplifies this
exploration to a simple recompile with new compiler directives.
This work introduces StreamBlocks, an open-source compiler and runtime that
uses the CAL dataflow programming language to partition computations across
heterogeneous (CPU/accelerator) platforms. Because of the dataflow model's
semantics and the CAL language, StreamBlocks can exploit both thread
parallelism in multi-core CPUs and the inherent parallelism of FPGAs.
StreamBlocks supports exploring the design space with a profile-guided tool
that helps identify the best hardware-software partitions.
Related papers
- A High-level Synthesis Toolchain for the Julia Language [1.7995266833057173]
We propose a new MLIR-based compiler toolchain that unifies the development process by automatically compiling kernels written in the Julia programming language into SystemVerilog.<n>Our toolchain supports both dynamic and static scheduling, directly integrates with the AXI4-Stream protocol, and generates vendor-agnostic RTL.<n>This prototype toolchain is able to synthesize a set of signal processing/mathematical benchmarks that can operate at 100MHz on real FPGA devices.
arXiv Detail & Related papers (2025-12-17T18:32:06Z) - Understanding Accelerator Compilers via Performance Profiling [1.1841612917872066]
Accelerator design languages (ADLs) are high-level languages that compile to hardware units.<n>We introduce Petal, a cycle-level tool for understanding how the compiler's decisions affect performance.<n>We show that Petal's cycle-level profiles can identify performance problems in existing designs.
arXiv Detail & Related papers (2025-11-24T22:40:11Z) - Beyond the GPU: The Strategic Role of FPGAs in the Next Wave of AI [0.0]
Field-Programmable Gate Arrays (FPGAs) are a reconfigurable platform that allows mapping AI algorithms directly into device logic.<n>Unlike CPU and GPU architecture, an FPGA can be reconfigured in the field to adapt its physical structure to a specific model.<n> Partial reconfiguration and compilation flows from AI frameworks are shortening the path from prototype to deployment.
arXiv Detail & Related papers (2025-11-04T03:41:42Z) - Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z) - NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding [54.88765757043535]
This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference.<n>Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types with less than 7% computational overhead.<n>The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search.
arXiv Detail & Related papers (2025-05-28T20:43:10Z) - Scaling Tractable Probabilistic Circuits: A Systems Perspective [53.76194929291088]
PyJuice is a general implementation design for PCs that improves prior art in several regards.<n>It is 1-2 orders of magnitude faster than existing systems at training large-scale PCs.<n>PyJuice consumes 2-5x less memory, which enables us to train larger models.
arXiv Detail & Related papers (2024-06-02T14:57:00Z) - An approach to performance portability through generic programming [0.0]
This work describes a design approach that allows the integration of low-level and verbose programming tools into high-level generic algorithms based on template meta-programming in C++.
That allows scientific software to be maintainable and efficient in a period of diversifying hardware in HPC.
arXiv Detail & Related papers (2023-11-08T21:54:43Z) - Guess & Sketch: Language Model Guided Transpilation [59.02147255276078]
Learned transpilation offers an alternative to manual re-writing and engineering efforts.
Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness.
Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence.
arXiv Detail & Related papers (2023-09-25T15:42:18Z) - Specx: a C++ task-based runtime system for heterogeneous distributed architectures [0.0]
Specx is a task-based runtime system written in modern C++.
We present Specx, a task-based runtime system written in modern C++.
arXiv Detail & Related papers (2023-08-30T11:41:30Z) - INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order
Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient.
We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture.
We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - QParallel: Explicit Parallelism for Programming Quantum Computers [62.10004571940546]
We present a language extension for parallel quantum programming.
QParallel removes ambiguities concerning parallelism in current quantum programming languages.
We introduce a tool that guides programmers in the placement of parallel regions by identifying the subroutines that profit most from parallelization.
arXiv Detail & Related papers (2022-10-07T16:35:16Z) - Providing Meaningful Data Summarizations Using Examplar-based Clustering
in Industry 4.0 [67.80123919697971]
We show, that our GPU implementation provides speedups of up to 72x using single-precision and up to 452x using half-precision compared to conventional CPU algorithms.
We apply our algorithm to real-world data from injection molding manufacturing processes and discuss how found summaries help with steering this specific process to cut costs and reduce the manufacturing of bad parts.
arXiv Detail & Related papers (2021-05-25T15:55:14Z) - Systolic Computing on GPUs for Productive Performance [2.8064596842326575]
We propose a language and compiler to productively build high-performance systolic arrays that run on GPUs.
A programmer it' specifies a projection of a dataflow compute onto a linear systolic array, while leaving the detailed implementation of the projection to a compiler.
The compiler implements the specified projection and maps the linear systolic array to the SIMD execution units and vector registers of GPUs.
arXiv Detail & Related papers (2020-10-29T18:49:54Z) - Extending C++ for Heterogeneous Quantum-Classical Computing [56.782064931823015]
qcor is a language extension to C++ and compiler implementation that enables heterogeneous quantum-classical programming, compilation, and execution in a single-source context.
Our work provides a first-of-its-kind C++ compiler enabling high-level quantum kernel (function) expression in a quantum-language manner.
arXiv Detail & Related papers (2020-10-08T12:49:07Z) - HeAT -- a Distributed and GPU-accelerated Tensor Framework for Data
Analytics [0.0]
HeAT is an array-based numerical programming framework for large-scale parallel processing with an easy-to-use NumPy-like API.
HeAT utilizes PyTorch as a node-local eager execution engine and distributes the workload on arbitrarily large high-performance computing systems via MPI.
When compared to similar frameworks, HeAT achieves speedups of up to two orders of magnitude.
arXiv Detail & Related papers (2020-07-27T13:33:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.