PowerInfer-2: Fast Large Language Model Inference on a Smartphone
- URL: http://arxiv.org/abs/2406.06282v3
- Date: Thu, 12 Dec 2024 12:24:18 GMT
- Title: PowerInfer-2: Fast Large Language Model Inference on a Smartphone
- Authors: Zhenliang Xue, Yixin Song, Zeyu Mi, Xinrui Zheng, Yubin Xia, Haibo Chen,
- Abstract summary: Large language models (LLMs) on smartphones enable real-time AI assistance and privacy-preserving, offline operation.
This paper introduces PowerInfer-2, a smartphone-based framework that enables fast inference for LLMs exceeding the memory capacity.
PowerInfer-2 is the first system to serve a 47B LLM on a smartphone, achieving 11.68 tokens/s.
- Score: 4.75185107146461
- License:
- Abstract: Large language models (LLMs) on smartphones enable real-time AI assistance and privacy-preserving, offline operation. However, resource constraints of smartphones limit current deployments to small language models (SLMs), significantly compromising their capabilities. This paper introduces PowerInfer-2, a smartphone-based framework that enables fast inference for LLMs exceeding the memory capacity. The key insight is decomposing matrix operations into neuron clusters as the basic processing unit, which enables flexible scheduling and efficient I/O-computation pipelining. PowerInfer-2 leverages this neuron-cluster-based design in both computation and storage. For computation, neuron clusters with dense activations are processed on NPU, while sparse clusters use CPU. The storage engine provides a fine-grained pipeline mechanism that coordinates cluster-level computation and I/O operations, enhanced by a segmented neuron cache to reduce I/O activities. PowerInfer-2 achieves up to a 27.8x speed increase compared to state-of-the-art frameworks. PowerInfer-2 is the first system to serve a 47B LLM on a smartphone, achieving 11.68 tokens/s. Notably, these performance improvements preserve model quality with negligible accuracy degradation.
Related papers
- Benchmarking Edge AI Platforms for High-Performance ML Inference [0.0]
Edge computing's growing prominence, due to its ability to reduce communication latency and enable real-time processing, is promoting the rise of high-performance, heterogeneous System-on-Chip solutions.
While current approaches often involve scaling down modern hardware, the performance characteristics of neural network workloads can vary significantly.
We compare the latency and throughput of various linear algebra and neural network inference tasks across CPU-only, CPU/GPU, and CPU/NPU integrated solutions.
arXiv Detail & Related papers (2024-09-23T08:27:27Z) - MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models [58.3342517278868]
This paper describes the design of Mixed-precision AutoRegressive LINear kernels.
It shows that batchsizes up to 16-32 can be supported with close to maximum ($4times$) quantization speedup.
MarLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining.
arXiv Detail & Related papers (2024-08-21T16:10:41Z) - Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference [2.9302211589186244]
Large language models (LLMs) have transformed natural language processing, enabling machines to generate human-like text and engage in meaningful conversations.
Developments in computing and memory capabilities are lagging behind, exacerbated by the discontinuation of Moore's law.
compute-in-memory (CIM) technologies offer a promising solution for accelerating AI inference by directly performing analog computations in memory.
arXiv Detail & Related papers (2024-06-12T16:57:58Z) - Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - Distributed Inference and Fine-tuning of Large Language Models Over The
Internet [91.00270820533272]
Large language models (LLMs) are useful in many NLP tasks and become more capable with size.
These models require high-end hardware, making them inaccessible to most researchers.
We develop fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput.
arXiv Detail & Related papers (2023-12-13T18:52:49Z) - Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative
Model Inference with Unstructured Sparsity [12.663030430488922]
We propose Flash-LLM for enabling low-cost and highly-efficient large generative model inference on high-performance Cores.
At SpMM kernel level, Flash-LLM significantly outperforms the state-of-the-art library, i.e., Sputnik and SparTA by an average of 2.9x and 1.5x, respectively.
arXiv Detail & Related papers (2023-09-19T03:20:02Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - A Heterogeneous In-Memory Computing Cluster For Flexible End-to-End
Inference of Real-World Deep Neural Networks [12.361842554233558]
Deployment of modern TinyML tasks on small battery-constrained IoT devices requires high computational energy efficiency.
Analog In-Memory Computing (IMC) using non-volatile memory (NVM) promises major efficiency improvements in deep neural network (DNN) inference.
We present a heterogeneous tightly-coupled architecture integrating 8 RISC-V cores, an in-memory computing accelerator (IMA), and digital accelerators.
arXiv Detail & Related papers (2022-01-04T11:12:01Z) - POEM: 1-bit Point-wise Operations based on Expectation-Maximization for
Efficient Point Cloud Processing [53.74076015905961]
We introduce point-wise operations based on Expectation-Maximization into BNNs for efficient point cloud processing.
Our POEM surpasses existing the state-of-the-art binary point cloud networks by a significant margin, up to 6.7 %.
arXiv Detail & Related papers (2021-11-26T09:45:01Z) - SmartDeal: Re-Modeling Deep Network Weights for Efficient Inference and
Training [82.35376405568975]
Deep neural networks (DNNs) come with heavy parameterization, leading to external dynamic random-access memory (DRAM) for storage.
We present SmartDeal (SD), an algorithm framework to trade higher-cost memory storage/access for lower-cost computation.
We show that SD leads to 10.56x and 4.48x reduction in the storage and training energy, with negligible accuracy loss compared to state-of-the-art training baselines.
arXiv Detail & Related papers (2021-01-04T18:54:07Z) - Efficient Neural Network Deployment for Microcontroller [0.0]
This paper is going to explore and generalize convolution neural network deployment for microcontrollers.
The memory savings and performance will be compared with CMSIS-NN framework developed for ARM Cortex-M CPUs.
The final purpose is to develop a tool consuming PyTorch model with trained network weights, and it turns into an optimized inference engine in C/C++ for low memory(kilobyte level) and limited computing capable microcontrollers.
arXiv Detail & Related papers (2020-07-02T19:21:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.