Related papers: Dissecting the Impact of Mobile DVFS Governors on LLM Inference Performance and Energy Efficiency

Dissecting the Impact of Mobile DVFS Governors on LLM Inference Performance and Energy Efficiency

URL: http://arxiv.org/abs/2507.02135v1
Date: Wed, 02 Jul 2025 20:47:40 GMT
Title: Dissecting the Impact of Mobile DVFS Governors on LLM Inference Performance and Energy Efficiency
Authors: Zongpu Zhang, Pranab Dash, Y. Charlie Hu, Qiang Xu, Jian Li, Haibing Guan,
Abstract summary: Large Language Models (LLMs) are increasingly being integrated into various applications and services running on billions of mobile devices.<n>Currently, deploying LLMs on resource-limited mobile devices faces a significant challenge due to their high demand for computation, memory, and ultimately energy.
Score: 20.904706759529237
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) are increasingly being integrated into various applications and services running on billions of mobile devices. However, deploying LLMs on resource-limited mobile devices faces a significant challenge due to their high demand for computation, memory, and ultimately energy. While current LLM frameworks for mobile use three power-hungry components-CPU, GPU, and Memory-even when running primarily-GPU LLM models, optimized DVFS governors for CPU, GPU, and memory featured in modern mobile devices operate independently and are oblivious of each other. Motivated by the above observation, in this work, we first measure the energy-efficiency of a SOTA LLM framework consisting of various LLM models on mobile phones which showed the triplet mobile governors result in up to 40.4% longer prefilling and decoding latency compared to optimal combinations of CPU, GPU, and memory frequencies with the same energy consumption for sampled prefill and decode lengths. Second, we conduct an in-depth measurement study to uncover how the intricate interplay (or lack of) among the mobile governors cause the above inefficiency in LLM inference. Finally, based on these insights, we design FUSE - a unified energy-aware governor for optimizing the energy efficiency of LLM inference on mobile devices. Our evaluation using a ShareGPT dataset shows FUSE reduces the time-to-first-token and time-per-output-token latencies by 7.0%-16.9% and 25.4%-36.8% on average with the same energy-per-token for various mobile LLM models.

Related papers

EfficientLLM: Efficiency in Large Language Models [64.3537131208038]
Large Language Models (LLMs) have driven significant progress, yet their growing counts and context windows incur prohibitive compute, energy, and monetary costs.<n>We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale.
arXiv Detail & Related papers (2025-05-20T02:27:08Z)
Are We There Yet? A Measurement Study of Efficiency for LLM Applications on Mobile Devices [5.926813659185372]
Small-size large language models (LLMs) can run successfully on powerful mobile devices, though they exhibit quality limitations compared to larger models.<n>Only small-size LLMs can run successfully on powerful mobile devices, though they exhibit quality limitations compared to larger models.
arXiv Detail & Related papers (2025-03-10T16:27:17Z)
WDMoE: Wireless Distributed Mixture of Experts for Large Language Models [68.45482959423323]
Large Language Models (LLMs) have achieved significant success in various natural language processing tasks. We propose a wireless distributed Mixture of Experts (WDMoE) architecture to enable collaborative deployment of LLMs across edge servers at the base station (BS) and mobile devices in wireless networks.
arXiv Detail & Related papers (2024-11-11T02:48:00Z)
PalmBench: A Comprehensive Benchmark of Compressed Large Language Models on Mobile Platforms [11.87161637895978]
We introduce our lightweight, all-in-one automated benchmarking framework that allows users to evaluate large language models on mobile devices.<n>We provide a benchmark of various popular LLMs with different quantization configurations (both weights and activations) across multiple mobile platforms with varying hardware capabilities.
arXiv Detail & Related papers (2024-10-05T03:37:07Z)
Understanding Large Language Models in Your Pockets: Performance Study on COTS Mobile Devices [10.817783356090027]
Large language models (LLMs) increasingly integrate into every aspect of our work and daily lives.<n>There are growing concerns about user privacy, which push the trend toward local deployment of these models.<n>As a rapidly emerging application, we are concerned about their performance on commercial-off-the-shelf mobile devices.
arXiv Detail & Related papers (2024-10-04T17:14:59Z)
MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases [81.70591346986582]
We introduce MobileAIBench, a benchmarking framework for evaluating Large Language Models (LLMs) and Large Multimodal Models (LMMs) on mobile devices. MobileAIBench assesses models across different sizes, quantization levels, and tasks, measuring latency and resource consumption on real devices.
arXiv Detail & Related papers (2024-06-12T22:58:12Z)
MELTing point: Mobile Evaluation of Language Transformers [8.238355633015068]
We explore the current state of mobile execution of Large Language Models (LLMs) We have created our own automation infrastructure, MELT, which supports the headless execution and benchmarking of LLMs on device. We evaluate popular instruction fine-tuned LLMs and leverage different frameworks to measure their end-to-end and granular performance.
arXiv Detail & Related papers (2024-03-19T15:51:21Z)
MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT [87.4910758026772]
"Bigger the better" has been the predominant trend in recent Large Language Models (LLMs) development. This paper explores the "less is more" paradigm by addressing the challenge of designing accurate yet efficient Small Language Models (SLMs) for resource constrained devices.
arXiv Detail & Related papers (2024-02-26T18:59:03Z)
Confidant: Customizing Transformer-based LLMs via Collaborative Edge Training [18.526329975259483]
Transformer-based large language models (LLMs) have demonstrated impressive capabilities in a variety of natural language processing (NLP) tasks. It is challenging to deploy and fine-tune LLMs on mobile edge devices with limited computing, memory, and energy budgets. We propose Confidant, a multi-backend collaborative training framework for customizing state-of-the-art LLMs on commodity mobile devices.
arXiv Detail & Related papers (2023-11-22T13:20:59Z)
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration [54.692405042065815]
We propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. AWQ protects only 1% salient weights and achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. We also implement TinyChat, an efficient and flexible inference framework tailored for 4-bit on-device LLM/VLMs.
arXiv Detail & Related papers (2023-06-01T17:59:10Z)
LEAF + AIO: Edge-Assisted Energy-Aware Object Detection for Mobile Augmented Reality [77.00418462388525]
Mobile augmented reality (MAR) applications are significantly energy-guzzling. We design an edge-based energy-aware MAR system that enables MAR devices to dynamically change their configurations. Our proposed dynamic MAR configuration adaptations can minimize the per frame energy consumption of multiple MAR clients.
arXiv Detail & Related papers (2022-05-27T06:11:50Z)
To Talk or to Work: Flexible Communication Compression for Energy Efficient Federated Learning over Heterogeneous Mobile Edge Devices [78.38046945665538]
federated learning (FL) over massive mobile edge devices opens new horizons for numerous intelligent mobile applications. FL imposes huge communication and computation burdens on participating devices due to periodical global synchronization and continuous local training. We develop a convergence-guaranteed FL algorithm enabling flexible communication compression.
arXiv Detail & Related papers (2020-12-22T02:54:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.