Related papers: Performance Analysis of Deep Learning Workloads on a Composable System

Performance Analysis of Deep Learning Workloads on a Composable System

URL: http://arxiv.org/abs/2103.10911v1
Date: Fri, 19 Mar 2021 17:15:42 GMT
Title: Performance Analysis of Deep Learning Workloads on a Composable System
Authors: Kauotar El Maghraoui and Lorraine M. Herger and Chekuri Choudary and Kim Tran and Todd Deshane and David Hanson
Abstract summary: Composable infrastructure is defined as resources, such as compute, storage, accelerators and networking, that are shared in a pool. This paper details the design of an enterprise composable infrastructure that we have implemented and made available to our partners in the IBM Research AI Hardware Center.
Score: 0.08388591755871731
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: A composable infrastructure is defined as resources, such as compute, storage, accelerators and networking, that are shared in a pool and that can be grouped in various configurations to meet application requirements. This freedom to 'mix and match' resources dynamically allows for experimentation early in the design cycle, prior to the final architectural design or hardware implementation of a system. This design provides flexibility to serve a variety of workloads and provides a dynamic co-design platform that allows experiments and measurements in a controlled manner. For instance, key performance bottlenecks can be revealed early on in the experimentation phase thus avoiding costly and time consuming mistakes. Additionally, various system-level topologies can be evaluated when experimenting with new System on Chip (SoCs) and new accelerator types. This paper details the design of an enterprise composable infrastructure that we have implemented and made available to our partners in the IBM Research AI Hardware Center (AIHC). Our experimental evaluations on the composable system give insights into how the system works and evaluates the impact of various resource aggregations and reconfigurations on representative deep learning benchmarks.

Related papers

Learning Compatible Multi-Prize Subnetworks for Asymmetric Retrieval [62.904384887568284]
Asymmetric retrieval is a typical scenario in real-world retrieval systems. We propose a Prunable Network with self-compatibility, which allows developers to generate compatibleworks at any desired capacity.
arXiv Detail & Related papers (2025-04-16T08:59:47Z)
A quantitative framework for evaluating architectural patterns in ML systems [49.1574468325115]
This study proposes a framework for quantitative assessment of architectural patterns in ML systems. We focus on scalability and performance metrics for cost-effective CPU-based inference.
arXiv Detail & Related papers (2025-01-20T15:30:09Z)
From Computation to Consumption: Exploring the Compute-Energy Link for Training and Testing Neural Networks for SED Systems [9.658615045493734]
We study several neural network architectures that are key components of sound event detection systems. We measure the energy consumption for training and testing small to large architectures. We establish complex relationships between the energy consumption, the number of floating-point operations, the number of parameters, and the GPU/memory utilization.
arXiv Detail & Related papers (2024-09-08T12:51:34Z)
Full-stack evaluation of Machine Learning inference workloads for RISC-V systems [0.2621434923709917]
This study evaluates the performance of a wide array of machine learning workloads on RISC-V architectures using gem5, an open-source architectural simulator. Leveraging an open-source compilation toolchain based on Multi-Level Intermediate Representation (MLIR), the research presents benchmarking results specifically focused on deep learning inference workloads.
arXiv Detail & Related papers (2024-05-24T09:24:46Z)
PEFSL: A deployment Pipeline for Embedded Few-Shot Learning on a FPGA SoC [0.0]
We develop an end-to-end open-source pipeline for a few-shot learning platform for object classification on a FPGA system. We build and deploy a low-power, low-latency demonstrator trained on the MiniImageNet dataset with a dataflow architecture. The proposed system has a latency of 30 ms while consuming 6.2 W on the PYNQ-Z1 board.
arXiv Detail & Related papers (2024-04-30T08:33:52Z)
Multilayer Environment and Toolchain for Holistic NetwOrk Design and Analysis [2.7763199324745966]
This work analyses in detail the requirements for distributed systems assessment. Our approach emphasizes setting up and assessing a broader spectrum of distributed systems. We demonstrate the framework's capabilities to provide valuable insights across various use cases.
arXiv Detail & Related papers (2023-10-24T21:18:25Z)
Reconfigurable Distributed FPGA Cluster Design for Deep Learning Accelerators [59.11160990637615]
We propose a distributed system based on lowpower embedded FPGAs designed for edge computing applications. The proposed system can simultaneously execute diverse Neural Network (NN) models, arrange the graph in a pipeline structure, and manually allocate greater resources to the most computationally intensive layers of the NN graph.
arXiv Detail & Related papers (2023-05-24T16:08:55Z)
Distributed intelligence on the Edge-to-Cloud Continuum: A systematic literature review [62.997667081978825]
This review aims at providing a comprehensive vision of the main state-of-the-art libraries and frameworks for machine learning and data analytics available today. The main simulation, emulation, deployment systems, and testbeds for experimental research on the Edge-to-Cloud Continuum available today are also surveyed.
arXiv Detail & Related papers (2022-04-29T08:06:05Z)
An Extensible Benchmark Suite for Learning to Simulate Physical Systems [60.249111272844374]
We introduce a set of benchmark problems to take a step towards unified benchmarks and evaluation protocols. We propose four representative physical systems, as well as a collection of both widely used classical time-based and representative data-driven methods.
arXiv Detail & Related papers (2021-08-09T17:39:09Z)
Elastic Architecture Search for Diverse Tasks with Different Resources [87.23061200971912]
We study a new challenging problem of efficient deployment for diverse tasks with different resources, where the resource constraint and task of interest corresponding to a group of classes are dynamically specified at testing time. Previous NAS approaches seek to design architectures for all classes simultaneously, which may not be optimal for some individual tasks. We present a novel and general framework, called Elastic Architecture Search (EAS), permitting instant specializations at runtime for diverse tasks with various resource constraints.
arXiv Detail & Related papers (2021-08-03T00:54:27Z)
Integrated Benchmarking and Design for Reproducible and Accessible Evaluation of Robotic Agents [61.36681529571202]
We describe a new concept for reproducible robotics research that integrates development and benchmarking. One of the central components of this setup is the Duckietown Autolab, a standardized setup that is itself relatively low-cost and reproducible. We validate the system by analyzing the repeatability of experiments conducted using the infrastructure and show that there is low variance across different robot hardware and across different remote labs.
arXiv Detail & Related papers (2020-09-09T15:31:29Z)
How to Train Your Super-Net: An Analysis of Training Heuristics in Weight-Sharing NAS [64.50415611717057]
We show that some commonly-used baselines for super-net training negatively impact the correlation between super-net and stand-alone performance. Our code and experiments set a strong and reproducible baseline that future works can build on.
arXiv Detail & Related papers (2020-03-09T17:34:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.