Related papers: Scaling On-Device GPU Inference for Large Generative Models

Scaling On-Device GPU Inference for Large Generative Models

URL: http://arxiv.org/abs/2505.00232v1
Date: Thu, 01 May 2025 00:44:13 GMT
Title: Scaling On-Device GPU Inference for Large Generative Models
Authors: Jiuqiang Tang, Raman Sarokin, Ekaterina Ignasheva, Grant Jensen, Lin Chen, Juhyun Lee, Andrei Kulik, Matthias Grundmann,
Abstract summary: ML Drift is an optimized framework that extends the capabilities of state-of-the-art GPU-accelerated inference engines.<n>Our GPU-accelerated ML/AI inference engine achieves an order-of-magnitude performance improvement relative to existing open-source GPU inference engines.
Score: 5.938112995772544
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Driven by the advancements in generative AI, large machine learning models have revolutionized domains such as image processing, audio synthesis, and speech recognition. While server-based deployments remain the locus of peak performance, the imperative for on-device inference, necessitated by privacy and efficiency considerations, persists. Recognizing GPUs as the on-device ML accelerator with the widest reach, we present ML Drift--an optimized framework that extends the capabilities of state-of-the-art GPU-accelerated inference engines. ML Drift enables on-device execution of generative AI workloads which contain 10 to 100x more parameters than existing on-device generative AI models. ML Drift addresses intricate engineering challenges associated with cross-GPU API development, and ensures broad compatibility across mobile and desktop/laptop platforms, thereby facilitating the deployment of significantly more complex models on resource-constrained devices. Our GPU-accelerated ML/AI inference engine achieves an order-of-magnitude performance improvement relative to existing open-source GPU inference engines.

Related papers

MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints [7.287566040274871]
MoE-Lens is an inference system designed through holistic performance modeling for resource-constrained environments.<n>It captures the system execution mechanisms to identify the key hardware bottlenecks and accurately predict the achievable throughput.<n> evaluated on diverse MoE models and datasets, MoE-Lens outperforms the state-of-the-art solution by 4.6x on average (up to 25.5x)
arXiv Detail & Related papers (2025-04-12T21:26:56Z)
HeteroLLM: Accelerating Large Language Model Inference on Mobile SoCs platform with Heterogeneous AI Accelerators [7.377592753635839]
HeteroLLM is the fastest LLM inference engine in mobile devices which supports both layer-level and tensor-level heterogeneous execution.<n> Evaluation results show that HeteroLLM achieves 9.99 and 4.36 performance improvement over other mobile-side LLM inference engines.
arXiv Detail & Related papers (2025-01-11T02:42:02Z)
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [114.61347672265076]
Development of MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. We propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR) that automatically adjusts the size of the activated MLLM. DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.
arXiv Detail & Related papers (2024-11-04T18:26:08Z)
FIRE-3DV: Framework-Independent Rendering Engine for 3D Graphics using Vulkan [4.226502078427161]
This paper presents a performance-focused and lightweight rendering engine supporting the modern Vulkan graphics API.<n>Our engine is used to modernize the legacy rendering pipeline of the Asynchronous Multi-Body Framework (AMBF), a dynamic simulation framework.<n>Experiments show that the engine can render a simulated scene with over seven million triangles while maintaining GPU computation times within two milliseconds.
arXiv Detail & Related papers (2024-10-07T14:50:19Z)
Mondrian: On-Device High-Performance Video Analytics with Compressive Packed Inference [7.624476059109304]
Mondrian is an edge system that enables high-performance object detection on high-resolution video streams. We devise a novel Compressive Packed Inference to minimize per-pixel processing costs.
arXiv Detail & Related papers (2024-03-12T12:35:12Z)
Using the Abstract Computer Architecture Description Language to Model AI Hardware Accelerators [77.89070422157178]
Manufacturers of AI-integrated products face a critical challenge: selecting an accelerator that aligns with their product's performance requirements. The Abstract Computer Architecture Description Language (ACADL) is a concise formalization of computer architecture block diagrams. In this paper, we demonstrate how to use the ACADL to model AI hardware accelerators, use their ACADL description to map DNNs onto them, and explain the timing simulation semantics to gather performance results.
arXiv Detail & Related papers (2024-01-30T19:27:16Z)
Random resistive memory-based deep extreme point learning machine for unified visual processing [67.51600474104171]
We propose a novel hardware-software co-design, random resistive memory-based deep extreme point learning machine (DEPLM) Our co-design system achieves huge energy efficiency improvements and training cost reduction when compared to conventional systems.
arXiv Detail & Related papers (2023-12-14T09:46:16Z)
SATAY: A Streaming Architecture Toolflow for Accelerating YOLO Models on FPGA Devices [48.47320494918925]
This work tackles the challenges of deploying stateof-the-art object detection models onto FPGA devices for ultralow latency applications. We employ a streaming architecture design for our YOLO accelerators, implementing the complete model on-chip in a deeply pipelined fashion. We introduce novel hardware components to support the operations of YOLO models in a dataflow manner, and off-chip memory buffering to address the limited on-chip memory resources.
arXiv Detail & Related papers (2023-09-04T13:15:01Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks. We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z)
Multi-model Machine Learning Inference Serving with GPU Spatial Partitioning [7.05946599544139]
High throughput machine learning (ML) inference servers are critical for online service applications. These servers must provide a bounded latency for each request to support consistent service-level objective (SLO) This paper proposes a new ML inference scheduling framework for multi-model ML inference servers.
arXiv Detail & Related papers (2021-09-01T04:46:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.