Related papers: Bridging the Gap Between Domain-specific Frameworks and Multiple Hardware Devices

Bridging the Gap Between Domain-specific Frameworks and Multiple Hardware Devices

URL: http://arxiv.org/abs/2405.12491v1
Date: Tue, 21 May 2024 04:24:47 GMT
Title: Bridging the Gap Between Domain-specific Frameworks and Multiple Hardware Devices
Authors: Xu Wen, Wanling Gao, Lei Wang, Jianfeng Zhan,
Abstract summary: We propose a systematic methodology that effectively bridges the gap between domain-specific frameworks and multiple hardware devices. The framework supports deep learning, classical machine learning, and data analysis across X86, ARM, RISC-V, IoT devices, and GPU. It outperforms existing solutions like scikit-learn, hummingbird, Spark, and pandas, achieving impressive speedups.
Score: 2.9694650164958802
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid development of domain-specific frameworks has presented us with a significant challenge: The current approach of implementing solutions on a case-by-case basis incurs a theoretical complexity of O(M*N), thereby increasing the cost of porting applications to different hardware platforms. To address these challenges, we propose a systematic methodology that effectively bridges the gap between domain-specific frameworks and multiple hardware devices, reducing porting complexity to O(M+N). The approach utilizes multi-layer abstractions. Different domain-specific abstractions are employed to represent applications from various domains. These abstractions are then transformed into a unified abstraction, which is subsequently translated into combinations of primitive operators. Finally, these operators are mapped to multiple hardware platforms. The implemented unified framework supports deep learning, classical machine learning, and data analysis across X86, ARM, RISC-V, IoT devices, and GPU. It outperforms existing solutions like scikit-learn, hummingbird, Spark, and pandas, achieving impressive speedups: 1.1x to 3.83x on X86 servers, 1.06x to 4.33x on ARM IoT devices, 1.25x to 3.72x on RISC-V IoT devices, and 1.93x on GPU. The source code is available at https://github.com/BenchCouncil/bridger.git.

Related papers

Optimizing FDTD Solvers for Electromagnetics: A Compiler-Guided Approach with High-Level Tensor Abstractions [0.7373617024876725]
We introduce an end-to-end domain-specific compiler based on the MLIR/LLVM infrastructure for Finite Difference Time Domain simulations. We implement the three-dimensional kernel as operations on a 3D tensor abstraction with explicit computational semantics. High-level optimizations such as loop tiling, fusion, and vectorization are automatically applied by the compiler.
arXiv Detail & Related papers (2025-04-12T08:08:12Z)
oneDAL Optimization for ARM Scalable Vector Extension: Maximizing Efficiency for High-Performance Data Science [1.5672115019395867]
UXL's oneAPI Data Analytics Library (oneDAL) is widely adopted for accelerating ML and data analytics. But its reliance on Intel's Math Kernel Library (MKL) has traditionally limited its compatibility to x86platforms. This paper details the porting of oneDAL to ARM architectures with SVE support, using OpenBLAS as an alternative backend to overcome architectural and performance challenges.
arXiv Detail & Related papers (2025-04-05T17:53:36Z)
Designing and Implementing a Generator Framework for a SIMD Abstraction Library [53.84310825081338]
We present TSLGen, a novel end-to-end framework for generating an SIMD abstraction library. We show that our framework is comparable to existing libraries, and we achieve the same performance results.
arXiv Detail & Related papers (2024-07-26T13:25:38Z)
Multi-objective Differentiable Neural Architecture Search [58.67218773054753]
We propose a novel NAS algorithm that encodes user preferences for the trade-off between performance and hardware metrics. Our method outperforms existing MOO NAS methods across a broad range of qualitatively different search spaces and datasets.
arXiv Detail & Related papers (2024-02-28T10:09:04Z)
Distributed Inference and Fine-tuning of Large Language Models Over The Internet [91.00270820533272]
Large language models (LLMs) are useful in many NLP tasks and become more capable with size. These models require high-end hardware, making them inaccessible to most researchers. We develop fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput.
arXiv Detail & Related papers (2023-12-13T18:52:49Z)
INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient. We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture. We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z)
Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels. We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z)
HDCC: A Hyperdimensional Computing compiler for classification on embedded systems and high-performance computing [58.720142291102135]
This work introduces the name compiler, the first open-source compiler that translates high-level descriptions of HDC classification methods into optimized C code. name is designed like a modern compiler, featuring an intuitive and descriptive input language, an intermediate representation (IR), and a retargetable backend. To substantiate these claims, we conducted experiments with HDCC on several of the most popular datasets in the HDC literature.
arXiv Detail & Related papers (2023-04-24T19:16:03Z)
Towards making the most of NLP-based device mapping optimization for OpenCL kernels [5.6596607119831575]
We extend the work of Cummins et al., namely Deeptune, that tackles the problem of optimal device selection ( CPU or GPU) for accelerated OpenCL kernels. We propose four different models that provide enhanced contextual information of source codes. Experimental results show that our proposed methodology surpasses that of Cummins et al. work, providing up to 4% improvement in prediction accuracy.
arXiv Detail & Related papers (2022-08-30T10:20:55Z)
Reconfigurable co-processor architecture with limited numerical precision to accelerate deep convolutional neural networks [0.38848561367220275]
Convolutional Neural Networks (CNNs) are widely used in deep learning applications, e.g. visual systems, robotics etc. Here, we present a model-independent reconfigurable co-processing architecture to accelerate CNNs. In contrast to existing solutions, we introduce limited precision 32 bit Q-format fixed point quantization for arithmetic representations and operations.
arXiv Detail & Related papers (2021-08-21T09:50:54Z)
TensorFlow Lite Micro: Embedded Machine Learning on TinyML Systems [5.188829601887422]
Deep learning inference on embedded devices is a burgeoning field with myriad applications because tiny embedded devices are omnipresent. Deep learning inference on embedded devices is a burgeoning field with myriad applications because tiny embedded devices are omnipresent. We introduce Lite Micro, an open-source ML inference framework for running deep-learning models on embedded systems.
arXiv Detail & Related papers (2020-10-17T00:44:30Z)
Efficient Execution of Quantized Deep Learning Models: A Compiler Approach [6.616902691349208]
A growing number of applications implement predictive functions using deep learning models. Deep learning frameworks such as TFLite, MXNet, and PyTorch enable developers to quantize models with only a small drop in accuracy. They are not well suited to execute quantized models on a variety of hardware platforms.
arXiv Detail & Related papers (2020-06-18T01:38:10Z)
SOL: Effortless Device Support for AI Frameworks without Source Code Changes [1.030051577369649]
We introduce SOL, an AI acceleration that provides a hardware abstraction layer that allows us to transparently support heterogeneous hardware. As a proof of concept, we implemented SOL for PyTorch with three backends: CPU, GPU and vector processors.
arXiv Detail & Related papers (2020-03-24T07:03:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.