PalmBench: A Comprehensive Benchmark of Compressed Large Language Models on Mobile Platforms
- URL: http://arxiv.org/abs/2410.05315v1
- Date: Sat, 5 Oct 2024 03:37:07 GMT
- Title: PalmBench: A Comprehensive Benchmark of Compressed Large Language Models on Mobile Platforms
- Authors: Yilong Li, Jingyu Liu, Hao Zhang, M Badri Narayanan, Utkarsh Sharma, Shuai Zhang, Pan Hu, Yijing Zeng, Jayaram Raghuram, Suman Banerjee,
- Abstract summary: We introduce our lightweight, all-in-one automated benchmarking framework that allows users to evaluate large language models on mobile devices.
We provide a benchmark of various popular LLMs with different quantization configurations (both weights and activations) across multiple mobile platforms with varying hardware capabilities.
- Score: 11.87161637895978
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deploying large language models (LLMs) locally on mobile devices is advantageous in scenarios where transmitting data to remote cloud servers is either undesirable due to privacy concerns or impractical due to network connection. Recent advancements (MLC, 2023a; Gerganov, 2023) have facilitated the local deployment of LLMs. However, local deployment also presents challenges, particularly in balancing quality (generative performance), latency, and throughput within the hardware constraints of mobile devices. In this paper, we introduce our lightweight, all-in-one automated benchmarking framework that allows users to evaluate LLMs on mobile devices. We provide a comprehensive benchmark of various popular LLMs with different quantization configurations (both weights and activations) across multiple mobile platforms with varying hardware capabilities. Unlike traditional benchmarks that assess full-scale models on high-end GPU clusters, we focus on evaluating resource efficiency (memory and power consumption) and harmful output for compressed models on mobile devices. Our key observations include i) differences in energy efficiency and throughput across mobile platforms; ii) the impact of quantization on memory usage, GPU execution time, and power consumption; and iii) accuracy and performance degradation of quantized models compared to their non-quantized counterparts; and iv) the frequency of hallucinations and toxic content generated by compressed LLMs on mobile devices.
Related papers
- Large Language Model Performance Benchmarking on Mobile Platforms: A Thorough Evaluation [10.817783356090027]
Large language models (LLMs) increasingly integrate into every aspect of our work and daily lives.
There are growing concerns about user privacy, which push the trend toward local deployment of these models.
As a rapidly emerging application, we are concerned about their performance on commercial-off-the-shelf mobile devices.
arXiv Detail & Related papers (2024-10-04T17:14:59Z) - MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases [81.70591346986582]
We introduce MobileAIBench, a benchmarking framework for evaluating Large Language Models (LLMs) and Large Multimodal Models (LMMs) on mobile devices.
MobileAIBench assesses models across different sizes, quantization levels, and tasks, measuring latency and resource consumption on real devices.
arXiv Detail & Related papers (2024-06-12T22:58:12Z) - LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit [55.73370804397226]
Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating large language models.
We present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization.
Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats.
arXiv Detail & Related papers (2024-05-09T11:49:05Z) - MELTing point: Mobile Evaluation of Language Transformers [8.238355633015068]
We explore the current state of mobile execution of Large Language Models (LLMs)
We have created our own automation infrastructure, MELT, which supports the headless execution and benchmarking of LLMs on device.
We evaluate popular instruction fine-tuned LLMs and leverage different frameworks to measure their end-to-end and granular performance.
arXiv Detail & Related papers (2024-03-19T15:51:21Z) - A Performance Evaluation of a Quantized Large Language Model on Various
Smartphones [0.0]
This paper explores the feasibility and performance of on-device large language model (LLM) inference on various Apple iPhone models.
Leveraging existing literature on running multi-billion parameter LLMs on resource-limited devices, our study examines the thermal effects and interaction speeds of a high-performing LLM.
We present real-world performance results, providing insights into on-device inference capabilities.
arXiv Detail & Related papers (2023-12-19T10:19:39Z) - Confidant: Customizing Transformer-based LLMs via Collaborative Edge
Training [18.526329975259483]
Transformer-based large language models (LLMs) have demonstrated impressive capabilities in a variety of natural language processing (NLP) tasks.
It is challenging to deploy and fine-tune LLMs on mobile edge devices with limited computing, memory, and energy budgets.
We propose Confidant, a multi-backend collaborative training framework for customizing state-of-the-art LLMs on commodity mobile devices.
arXiv Detail & Related papers (2023-11-22T13:20:59Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration [54.692405042065815]
We propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization.
AWQ protects only 1% salient weights and achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs.
We also implement TinyChat, an efficient and flexible inference framework tailored for 4-bit on-device LLM/VLMs.
arXiv Detail & Related papers (2023-06-01T17:59:10Z) - Cheaply Evaluating Inference Efficiency Metrics for Autoregressive
Transformer APIs [66.30706841821123]
Large language models (LLMs) power many state-of-the-art systems in natural language processing.
LLMs are extremely computationally expensive, even at inference time.
We propose a new metric for comparing inference efficiency across models.
arXiv Detail & Related papers (2023-05-03T21:51:42Z) - MetaNetwork: A Task-agnostic Network Parameters Generation Framework for
Improving Device Model Generalization [65.02542875281233]
We propose a novel task-agnostic framework, named MetaNetwork, for generating adaptive device model parameters from cloud without on-device training.
The MetaGenerator is designed to learn a mapping function from samples to model parameters, and it can generate and deliver the adaptive parameters to the device based on samples uploaded from the device to the cloud.
The MetaStabilizer aims to reduce the oscillation of the MetaGenerator, accelerate the convergence and improve the model performance during both training and inference.
arXiv Detail & Related papers (2022-09-12T13:26:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.