Towards Deterministic End-to-end Latency for Medical AI Systems in
NVIDIA Holoscan
- URL: http://arxiv.org/abs/2402.04466v1
- Date: Tue, 6 Feb 2024 23:20:34 GMT
- Title: Towards Deterministic End-to-end Latency for Medical AI Systems in
NVIDIA Holoscan
- Authors: Soham Sinha, Shekhar Dwivedi, Mahdi Azizian
- Abstract summary: Medical device manufacturers are keen to maximize the advantages afforded by AI and ML by consolidating multiple applications onto a single platform.
concurrent execution of several AI applications, each with its own visualization components, leads to unpredictable end-to-end latency.
This paper addresses these challenges within the context of the Holoscan platform, a real-time AI system for streaming sensor data and images.
- Score: 0.35516599670943777
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The introduction of AI and ML technologies into medical devices has
revolutionized healthcare diagnostics and treatments. Medical device
manufacturers are keen to maximize the advantages afforded by AI and ML by
consolidating multiple applications onto a single platform. However, concurrent
execution of several AI applications, each with its own visualization
components, leads to unpredictable end-to-end latency, primarily due to GPU
resource contentions. To mitigate this, manufacturers typically deploy separate
workstations for distinct AI applications, thereby increasing financial,
energy, and maintenance costs. This paper addresses these challenges within the
context of NVIDIA's Holoscan platform, a real-time AI system for streaming
sensor data and images. We propose a system design optimized for heterogeneous
GPU workloads, encompassing both compute and graphics tasks. Our design
leverages CUDA MPS for spatial partitioning of compute workloads and isolates
compute and graphics processing onto separate GPUs. We demonstrate significant
performance improvements across various end-to-end latency determinism metrics
through empirical evaluation with real-world Holoscan medical device
applications. For instance, the proposed design reduces maximum latency by
21-30% and improves latency distribution flatness by 17-25% for up to five
concurrent endoscopy tool tracking AI applications, compared to a single-GPU
baseline. Against a default multi-GPU setup, our optimizations decrease maximum
latency by 35% for up to six concurrent applications by improving GPU
utilization by 42%. This paper provides clear design insights for AI
applications in the edge-computing domain including medical systems, where
performance predictability of concurrent and heterogeneous GPU workloads is a
critical requirement.
Related papers
- PREBA: A Hardware/Software Co-Design for Multi-Instance GPU based AI Inference Servers [3.0518650058744075]
PREBA is a hardware/software co-design targeting MIG inference servers.
It provides a 3.7x throughput, 3.4x reduction in tail latency, 3.5x improvement in energy-efficiency and 3.0x improvement in cost-efficiency.
arXiv Detail & Related papers (2024-11-28T13:02:41Z) - Benchmarking Edge AI Platforms for High-Performance ML Inference [0.0]
Edge computing's growing prominence, due to its ability to reduce communication latency and enable real-time processing, is promoting the rise of high-performance, heterogeneous System-on-Chip solutions.
While current approaches often involve scaling down modern hardware, the performance characteristics of neural network workloads can vary significantly.
We compare the latency and throughput of various linear algebra and neural network inference tasks across CPU-only, CPU/GPU, and CPU/NPU integrated solutions.
arXiv Detail & Related papers (2024-09-23T08:27:27Z) - Multi-GPU RI-HF Energies and Analytic Gradients $-$ Towards High Throughput Ab Initio Molecular Dynamics [0.0]
This article presents an optimized algorithm and implementation for calculating resolution-of-the-identity Hartree-Fock energies and analytic gradients using multiple Graphics Processing Units (GPUs)
The algorithm is especially designed for high throughput emphab initio molecular dynamics simulations of small and medium size molecules (10-100 atoms)
arXiv Detail & Related papers (2024-07-29T00:14:10Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - Benchmarking Edge Computing Devices for Grape Bunches and Trunks
Detection using Accelerated Object Detection Single Shot MultiBox Deep
Learning Models [2.1922186455344796]
This work benchmarks the performance of different platforms for object detection in real-time.
Authors used the RetinaNet ResNet-50 fine-tuned using the natural Vine dataset.
arXiv Detail & Related papers (2022-11-21T17:02:33Z) - EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense
Prediction [67.11722682878722]
This work presents EfficientViT, a new family of high-resolution vision models with novel multi-scale linear attention.
Our multi-scale linear attention achieves the global receptive field and multi-scale learning.
EfficientViT delivers remarkable performance gains over previous state-of-the-art models.
arXiv Detail & Related papers (2022-05-29T20:07:23Z) - MAPLE-X: Latency Prediction with Explicit Microprocessor Prior Knowledge [87.41163540910854]
Deep neural network (DNN) latency characterization is a time-consuming process.
We propose MAPLE-X which extends MAPLE by incorporating explicit prior knowledge of hardware devices and DNN architecture latency.
arXiv Detail & Related papers (2022-05-25T11:08:20Z) - MAPLE-Edge: A Runtime Latency Predictor for Edge Devices [80.01591186546793]
We propose MAPLE-Edge, an edge device-oriented extension of MAPLE, the state-of-the-art latency predictor for general purpose hardware.
Compared to MAPLE, MAPLE-Edge can describe the runtime and target device platform using a much smaller set of CPU performance counters.
We also demonstrate that unlike MAPLE which performs best when trained on a pool of devices sharing a common runtime, MAPLE-Edge can effectively generalize across runtimes.
arXiv Detail & Related papers (2022-04-27T14:00:48Z) - Multi-Component Optimization and Efficient Deployment of Neural-Networks
on Resource-Constrained IoT Hardware [4.6095200019189475]
We present an end-to-end multi-component model optimization sequence and open-source its implementation.
Our optimization components can produce models that are; (i) 12.06 x times compressed; (ii) 0.13% to 0.27% more accurate; (iii) Orders of magnitude faster unit inference at 0.06 ms.
arXiv Detail & Related papers (2022-04-20T13:30:04Z) - FPGA-optimized Hardware acceleration for Spiking Neural Networks [69.49429223251178]
This work presents the development of a hardware accelerator for an SNN, with off-line training, applied to an image recognition task.
The design targets a Xilinx Artix-7 FPGA, using in total around the 40% of the available hardware resources.
It reduces the classification time by three orders of magnitude, with a small 4.5% impact on the accuracy, if compared to its software, full precision counterpart.
arXiv Detail & Related papers (2022-01-18T13:59:22Z) - MAPLE: Microprocessor A Priori for Latency Estimation [81.91509153539566]
Modern deep neural networks must demonstrate state-of-the-art accuracy while exhibiting low latency and energy consumption.
Measuring the latency of every evaluated architecture adds a significant amount of time to the NAS process.
We propose Microprocessor A Priori for Estimation Estimation MAPLE that does not rely on transfer learning or domain adaptation.
arXiv Detail & Related papers (2021-11-30T03:52:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.