Related papers: Subgraph Stationary Hardware-Software Inference Co-Design

Subgraph Stationary Hardware-Software Inference Co-Design

URL: http://arxiv.org/abs/2306.17266v1
Date: Wed, 21 Jun 2023 16:02:52 GMT
Title: Subgraph Stationary Hardware-Software Inference Co-Design
Authors: Payman Behnam, Jianming Tong, Alind Khare, Yangyu Chen, Yue Pan, Pranav Gadikar, Abhimanyu Rajeshkumar Bambhaniya, Tushar Krishna, Alexey Tumanov
Abstract summary: A growing body of research focuses on reaching better latency-accuracy tradeoffs for Machine Learning models. We make a case for applications that operate in dynamically changing deployment scenarios, where no single static point is optimal. We take a hardware-software co-design approach with a real implementation of SGS in SushiAccel and the implementation of a software scheduler SushiSched controlling which SubNets to serve and what to cache in real-time.
Score: 11.17417275752636
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A growing number of applications depend on Machine Learning (ML) functionality and benefits from both higher quality ML predictions and better timeliness (latency) at the same time. A growing body of research in computer architecture, ML, and systems software literature focuses on reaching better latency-accuracy tradeoffs for ML models. Efforts include compression, quantization, pruning, early-exit models, mixed DNN precision, as well as ML inference accelerator designs that minimize latency and energy, while preserving delivered accuracy. All of them, however, yield improvements for a single static point in the latency-accuracy tradeoff space. We make a case for applications that operate in dynamically changing deployment scenarios, where no single static point is optimal. We draw on a recently proposed weight-shared SuperNet mechanism to enable serving a stream of queries that uses (activates) different SubNets within this weight-shared construct. This creates an opportunity to exploit the inherent temporal locality with our proposed SubGraph Stationary (SGS) optimization. We take a hardware-software co-design approach with a real implementation of SGS in SushiAccel and the implementation of a software scheduler SushiSched controlling which SubNets to serve and what to cache in real-time. Combined, they are vertically integrated into SUSHI-an inference serving stack. For the stream of queries, SUSHI yields up to 25% improvement in latency, 0.98% increase in served accuracy. SUSHI can achieve up to 78.7% off-chip energy savings.

Related papers

QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge [55.75103034526652]
We propose QuartDepth which adopts post-training quantization to quantize MDE models with hardware accelerations for ASICs. Our approach involves quantizing both weights and activations to 4-bit precision, reducing the model size and computation cost. We design a flexible and programmable hardware accelerator by supporting kernel fusion and customized instruction programmability.
arXiv Detail & Related papers (2025-03-20T21:03:10Z)
Quamba: A Post-Training Quantization Recipe for Selective State Space Models [8.924779222965798]
State Space Models (SSMs) have emerged as an appealing alternative to Transformers for large language models. We propose a static 8-bit per-tensor SSM quantization method which suppresses the maximum values of the input activations to the selective SSM. Our 8-bit weight-activation quantized Mamba 2.8B SSM benefits from hardware acceleration and achieves a 1.72x lower generation latency on an Nvidia Orin Nano 8G, with only a 0.9% drop in average accuracy on zero-shot tasks.
arXiv Detail & Related papers (2024-10-17T05:32:33Z)
Co-designing a Sub-millisecond Latency Event-based Eye Tracking System with Submanifold Sparse CNN [8.613703056677457]
Eye-tracking technology is integral to numerous consumer electronics applications, particularly in virtual and augmented reality (VR/AR) Yet, achieving optimal performance across all these fronts presents a formidable challenge. We tackle this challenge through a synergistic software/ hardware co-design of the system with an event camera. Our system achieves 81% p5 accuracy, 99.5% p10 accuracy, and 3.71 Meanean Distance with 0.7 ms latency while only consuming 2.29 mJ per inference.
arXiv Detail & Related papers (2024-04-22T15:28:42Z)
SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads [18.461201610784077]
ML inference serving systems need to balance latency and accuracy requirements of an application. We show that SubNetAct simultaneously serves the entire range of models spanning the latency-accuracy tradeoff space. We show that SubNetAct requires upto 2.6x lower memory to serve a vastly-higher number of models than prior state-of-the-art.
arXiv Detail & Related papers (2023-12-27T22:24:11Z)
Low-Latency ML Inference by Grouping Correlated Data Objects and Computation [0.20482269513546453]
We propose a novel correlation grouping mechanism that makes it easier for developers to express application-specific data access correlations. Experiments based on a latency-sensitive ML-based application confirm the limitations of standard techniques. The proposed mechanism is able to maintain significantly lower and more consistent latency, higher node utilization as workload and scale-out increase.
arXiv Detail & Related papers (2023-11-30T16:02:04Z)
On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks [52.97107229149988]
We propose an On-Chip Hardware-Aware Quantization framework, performing hardware-aware mixed-precision quantization on deployed edge devices. For efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator. For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario.
arXiv Detail & Related papers (2023-09-05T04:39:34Z)
Latency-aware Unified Dynamic Networks for Efficient Image Recognition [72.8951331472913]
LAUDNet is a framework to bridge the theoretical and practical efficiency gap in dynamic networks. It integrates three primary dynamic paradigms-spatially adaptive computation, dynamic layer skipping, and dynamic channel skipping. It can notably reduce the latency of models like ResNet by over 50% on platforms such as V100,3090, and TX2 GPUs.
arXiv Detail & Related papers (2023-08-30T10:57:41Z)
MAPLE-X: Latency Prediction with Explicit Microprocessor Prior Knowledge [87.41163540910854]
Deep neural network (DNN) latency characterization is a time-consuming process. We propose MAPLE-X which extends MAPLE by incorporating explicit prior knowledge of hardware devices and DNN architecture latency.
arXiv Detail & Related papers (2022-05-25T11:08:20Z)
MAPLE-Edge: A Runtime Latency Predictor for Edge Devices [80.01591186546793]
We propose MAPLE-Edge, an edge device-oriented extension of MAPLE, the state-of-the-art latency predictor for general purpose hardware. Compared to MAPLE, MAPLE-Edge can describe the runtime and target device platform using a much smaller set of CPU performance counters. We also demonstrate that unlike MAPLE which performs best when trained on a pool of devices sharing a common runtime, MAPLE-Edge can effectively generalize across runtimes.
arXiv Detail & Related papers (2022-04-27T14:00:48Z)
An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices. We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations. Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z)
Joint Channel and Weight Pruning for Model Acceleration on Moblie Devices [37.51092726022731]
pruning is a widely adopted practice to balance the computational resource consumption and the accuracy. We present a unified framework with Joint Channel pruning and Weight pruning (JCW) We develop a tailored multi-objective evolutionary algorithm in the JCW framework, which enables one single search to obtain the optimal candidate architectures.
arXiv Detail & Related papers (2021-10-15T11:18:42Z)
AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation. Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.