Characterizing Mobile SoC for Accelerating Heterogeneous LLM Inference
- URL: http://arxiv.org/abs/2501.14794v2
- Date: Sat, 04 Oct 2025 03:01:51 GMT
- Title: Characterizing Mobile SoC for Accelerating Heterogeneous LLM Inference
- Authors: Le Chen, Dahu Feng, Erhu Feng, Yingrui Wang, Rong Zhao, Yubin Xia, Pinjie Xu, Haibo Chen,
- Abstract summary: HeteroInfer is the fastest LLM inference engine in mobile devices which supports GPU-NPU heterogeneous execution.<n> Evaluation shows that HeteroInfer delivers a 1.34x to 6.02x end-to-end speedup over state-of-the-art GPU-NPU engines.
- Score: 11.755602920122803
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rapid advancement of artificial intelligence technologies such as ChatGPT, AI agents, and video generation, contemporary mobile systems have begun integrating these AI capabilities on local devices to enhance privacy and reduce response latency. To meet the computational demands of AI tasks, current mobile SoCs are equipped with diverse AI accelerators, including GPUs and Neural Processing Units (NPUs). However, there has not been a comprehensive characterization of these heterogeneous processors, and existing designs typically only leverage a single AI accelerator for LLM inference, leading to suboptimal use of computational resources and memory bandwidth. In this paper, we first summarize key performance characteristics of heterogeneous processors, SoC memory bandwidth, etc. Drawing on these observations, we propose different heterogeneous parallel mechanisms to fully exploit both GPU and NPU computational power and memory bandwidth. We further design a fast synchronization mechanism between heterogeneous processors that leverages the unified memory architecture. By employing these techniques, we present HeteroInfer, the fastest LLM inference engine in mobile devices which supports GPU-NPU heterogeneous execution. Evaluation shows that HeteroInfer delivers a 1.34x to 6.02x end-to-end speedup over state-of-the-art GPU-only and NPU-only LLM engines, while maintaining negligible interference with other applications.
Related papers
- Accelerating Mobile Inference through Fine-Grained CPU-GPU Co-Execution [1.3356260369011272]
We propose a lightweight synchronization mechanism based on OpenCL fine-grained shared virtual memory (SVM) and machine learning models to accurately predict execution times.<n>A comprehensive evaluation on four mobile platforms shows that our approach can quickly select CPU- GPU co-execution strategies achieving up to 1.89x speedup for linear layers and 1.75x speedup for convolutional layers.
arXiv Detail & Related papers (2025-10-24T01:41:43Z) - Scaling On-Device GPU Inference for Large Generative Models [5.938112995772544]
ML Drift is an optimized framework that extends the capabilities of state-of-the-art GPU-accelerated inference engines.<n>Our GPU-accelerated ML/AI inference engine achieves an order-of-magnitude performance improvement relative to existing open-source GPU inference engines.
arXiv Detail & Related papers (2025-05-01T00:44:13Z) - Dovetail: A CPU/GPU Heterogeneous Speculative Decoding for LLM inference [31.901686946969786]
Dovetail is an inference method that leverages the complementary characteristics of heterogeneous devices and the advantages of speculative decoding.<n>Dovetail achieves inference speedups ranging from 1.79x to 10.1x across different devices, while maintaining consistency and stability in the distribution of generated texts.
arXiv Detail & Related papers (2024-12-25T15:45:18Z) - MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices [24.1144641404561]
We propose a scheme for exact attention inference acceleration on memory-constrained edge accelerators.<n>We show up to 2.75x speedup and 54% reduction in energy consumption as compared to the state-of-the-art attention fusion method (FLAT) in the edge computing scenario.
arXiv Detail & Related papers (2024-11-20T19:44:26Z) - Benchmarking Edge AI Platforms for High-Performance ML Inference [0.0]
Edge computing's growing prominence, due to its ability to reduce communication latency and enable real-time processing, is promoting the rise of high-performance, heterogeneous System-on-Chip solutions.
While current approaches often involve scaling down modern hardware, the performance characteristics of neural network workloads can vary significantly.
We compare the latency and throughput of various linear algebra and neural network inference tasks across CPU-only, CPU/GPU, and CPU/NPU integrated solutions.
arXiv Detail & Related papers (2024-09-23T08:27:27Z) - Inference Optimization of Foundation Models on AI Accelerators [68.24450520773688]
Powerful foundation models, including large language models (LLMs), with Transformer architectures have ushered in a new era of Generative AI.
As the number of model parameters reaches to hundreds of billions, their deployment incurs prohibitive inference costs and high latency in real-world scenarios.
This tutorial offers a comprehensive discussion on complementary inference optimization techniques using AI accelerators.
arXiv Detail & Related papers (2024-07-12T09:24:34Z) - Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference [2.9302211589186244]
Large language models (LLMs) have transformed natural language processing, enabling machines to generate human-like text and engage in meaningful conversations.
Developments in computing and memory capabilities are lagging behind, exacerbated by the discontinuation of Moore's law.
compute-in-memory (CIM) technologies offer a promising solution for accelerating AI inference by directly performing analog computations in memory.
arXiv Detail & Related papers (2024-06-12T16:57:58Z) - Deep Learning Inference on Heterogeneous Mobile Processors: Potentials and Pitfalls [22.49750818224266]
A growing demand to deploy computation-intensive deep learning (DL) models on resource-constrained mobile devices for real-time intelligent applications.
Mobile devices hold potential to accelerate DL inference via parallel execution across heterogeneous processors.
This paper presents a holistic empirical study to assess the capabilities and challenges associated with parallel DL inference on heterogeneous mobile processors.
arXiv Detail & Related papers (2024-05-03T04:47:23Z) - Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation [15.35494431928751]
Transformer-based large language models (LLMs) exhibit impressive performance in generative tasks but also introduce significant challenges in real-world serving.
We introduce model-attention disaggregation to enhance the efficiency of LLM decoding.
We develop and deploy Lamina, an LLM inference system that incorporates model-attention disaggregation in a distributed heterogeneous cluster.
arXiv Detail & Related papers (2024-05-03T02:15:15Z) - Efficient and accurate neural field reconstruction using resistive memory [52.68088466453264]
Traditional signal reconstruction methods on digital computers face both software and hardware challenges.
We propose a systematic approach with software-hardware co-optimizations for signal reconstruction from sparse inputs.
This work advances the AI-driven signal restoration technology and paves the way for future efficient and robust medical AI and 3D vision applications.
arXiv Detail & Related papers (2024-04-15T09:33:09Z) - Resistive Memory-based Neural Differential Equation Solver for Score-based Diffusion Model [55.116403765330084]
Current AIGC methods, such as score-based diffusion, are still deficient in terms of rapidity and efficiency.
We propose a time-continuous and analog in-memory neural differential equation solver for score-based diffusion.
We experimentally validate our solution with 180 nm resistive memory in-memory computing macros.
arXiv Detail & Related papers (2024-04-08T16:34:35Z) - MELTing point: Mobile Evaluation of Language Transformers [8.238355633015068]
We explore the current state of mobile execution of Large Language Models (LLMs)
We have created our own automation infrastructure, MELT, which supports the headless execution and benchmarking of LLMs on device.
We evaluate popular instruction fine-tuned LLMs and leverage different frameworks to measure their end-to-end and granular performance.
arXiv Detail & Related papers (2024-03-19T15:51:21Z) - Using the Abstract Computer Architecture Description Language to Model
AI Hardware Accelerators [77.89070422157178]
Manufacturers of AI-integrated products face a critical challenge: selecting an accelerator that aligns with their product's performance requirements.
The Abstract Computer Architecture Description Language (ACADL) is a concise formalization of computer architecture block diagrams.
In this paper, we demonstrate how to use the ACADL to model AI hardware accelerators, use their ACADL description to map DNNs onto them, and explain the timing simulation semantics to gather performance results.
arXiv Detail & Related papers (2024-01-30T19:27:16Z) - Machine Learning Insides OptVerse AI Solver: Design Principles and
Applications [74.67495900436728]
We present a comprehensive study on the integration of machine learning (ML) techniques into Huawei Cloud's OptVerse AI solver.
We showcase our methods for generating complex SAT and MILP instances utilizing generative models that mirror multifaceted structures of real-world problem.
We detail the incorporation of state-of-the-art parameter tuning algorithms which markedly elevate solver performance.
arXiv Detail & Related papers (2024-01-11T15:02:15Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and
Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks.
The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources.
This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z) - Asynchronous Parallel Incremental Block-Coordinate Descent for
Decentralized Machine Learning [55.198301429316125]
Machine learning (ML) is a key technique for big-data-driven modelling and analysis of massive Internet of Things (IoT) based intelligent and ubiquitous computing.
For fast-increasing applications and data amounts, distributed learning is a promising emerging paradigm since it is often impractical or inefficient to share/aggregate data.
This paper studies the problem of training an ML model over decentralized systems, where data are distributed over many user devices.
arXiv Detail & Related papers (2022-02-07T15:04:15Z) - How to Reach Real-Time AI on Consumer Devices? Solutions for
Programmable and Custom Architectures [7.085772863979686]
Deep neural networks (DNNs) have led to large strides in various Artificial Intelligence (AI) inference tasks, such as object and speech recognition.
deploying such AI models across commodity devices faces significant challenges.
We present techniques for achieving real-time performance following a cross-stack approach.
arXiv Detail & Related papers (2021-06-21T11:23:12Z) - Reconfigurable Intelligent Surface Assisted Mobile Edge Computing with
Heterogeneous Learning Tasks [53.1636151439562]
Mobile edge computing (MEC) provides a natural platform for AI applications.
We present an infrastructure to perform machine learning tasks at an MEC with the assistance of a reconfigurable intelligent surface (RIS)
Specifically, we minimize the learning error of all participating users by jointly optimizing transmit power of mobile users, beamforming vectors of the base station, and the phase-shift matrix of the RIS.
arXiv Detail & Related papers (2020-12-25T07:08:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.