Related papers: Forecasting LLM Inference Performance via Hardware-Agnostic Analytical Modeling

Forecasting LLM Inference Performance via Hardware-Agnostic Analytical Modeling

URL: http://arxiv.org/abs/2508.00904v1
Date: Tue, 29 Jul 2025 03:08:31 GMT
Title: Forecasting LLM Inference Performance via Hardware-Agnostic Analytical Modeling
Authors: Rajeev Patwari, Ashish Sirasao, Devleena Das,
Abstract summary: We introduce LIFE, a lightweight and modular analytical framework that is comprised of modular analytical model of operators.<n>LIFE characterizes the influence of software and model optimizations, such as quantization, KV cache compression, LoRA adapters, chunked prefill, different attentions, and operator fusion.<n>We validate LIFE's forecasting with inference on AMD CPUs, NPUs, iGPUs and NVIDIA V100 GPUs, with Llama2-7B variants.
Score: 0.02091806248191979
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have been increasingly deployed as local agents on personal devices with CPUs, NPUs and integrated GPUs. However, forecasting inference performance on devices with such heterogeneity remains challenging due to the dynamic compute and memory demands. Existing approaches rely on GPU benchmarking or machine learning-based latency predictors, which are often hardware-specific and lack generalizability. To this end, we introduce LIFE, a lightweight and modular analytical framework that is comprised of modular analytical model of operators, configurable to characterize LLM inference workloads in a hardware and dataset-agnostic manner. LIFE characterizes the influence of software and model optimizations, such as quantization, KV cache compression, LoRA adapters, chunked prefill, different attentions, and operator fusion, on performance metrics such as time-to-first-token (TTFT), time-per-output-token (TPOT) and tokens-per-second (TPS). LIFE enables performance forecasting using only hardware specifications, such as TOPS and memory bandwidth, without requiring extensive dataset benchmarking. We validate LIFE's forecasting with inference on AMD Ryzen CPUs, NPUs, iGPUs and NVIDIA V100 GPUs, with Llama2-7B variants, demonstrating the utility of LIFE in forecasting LLM performance through lens of system efficiency to enable efficient LLM deployment across different hardware platforms.

Related papers

Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition [95.54406667705999]
Pangu Embedded is an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs)<n>It addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs.<n>It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture.
arXiv Detail & Related papers (2025-05-28T14:03:02Z)
MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints [7.287566040274871]
MoE-Lens is an inference system designed through holistic performance modeling for resource-constrained environments.<n>It captures the system execution mechanisms to identify the key hardware bottlenecks and accurately predict the achievable throughput.<n> evaluated on diverse MoE models and datasets, MoE-Lens outperforms the state-of-the-art solution by 4.6x on average (up to 25.5x)
arXiv Detail & Related papers (2025-04-12T21:26:56Z)
Real-Time Semantic Segmentation of Aerial Images Using an Embedded U-Net: A Comparison of CPU, GPU, and FPGA Workflows [0.0]
This study introduces a lightweight U-Net model optimized for real-time semantic segmentation of aerial images.<n>We maintain the accuracy of the U-Net on a real-world dataset while significantly reducing the model's parameters and Multiply-Accumulate (MAC) operations by a factor of 16.
arXiv Detail & Related papers (2025-03-07T08:33:28Z)
Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format [5.527166214435735]
quantized large language models (LLMs) leverage low-bit integer (INT) weights and retain floating-point (FP) activations. This shifts the energy and latency bottlenecks towards the FP activations that are associated with costly memory accesses and computations. Existing LLM accelerators focus primarily on computation optimizations, overlooking the potential of jointly optimizing FP computations and data movement.
arXiv Detail & Related papers (2024-11-24T20:59:39Z)
WDMoE: Wireless Distributed Mixture of Experts for Large Language Models [68.45482959423323]
Large Language Models (LLMs) have achieved significant success in various natural language processing tasks. We propose a wireless distributed Mixture of Experts (WDMoE) architecture to enable collaborative deployment of LLMs across edge servers at the base station (BS) and mobile devices in wireless networks.
arXiv Detail & Related papers (2024-11-11T02:48:00Z)
LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators [1.1028525384019312]
Large Language Models (LLMs) have propelled groundbreaking advancements across several domains and are commonly used for text generation applications. We introduce LLM-Inference-Bench, a comprehensive benchmarking suite to evaluate the hardware inference performance of LLMs. Our benchmarking results reveal the strengths and limitations of various models, hardware platforms, and inference frameworks.
arXiv Detail & Related papers (2024-10-31T18:34:59Z)
Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective [32.827076621809965]
Large Language Models (LLMs) have demonstrated remarkable capabilities across various fields.<n>LLMs like GPT series and Llama series are currently the main focus due to their superior algorithmic performance.<n>Various hardware platforms exhibit distinct hardware characteristics, which can help improve LLM inference performance.
arXiv Detail & Related papers (2024-10-06T12:42:04Z)
Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference [2.2231908139555734]
We propose a general performance modeling methodology and workload analysis of distributed LLM training and inference. We validate our performance predictions with published data from literature and relevant industry vendors (e.g., NVIDIA)
arXiv Detail & Related papers (2024-07-19T19:49:05Z)
MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases [81.70591346986582]
We introduce MobileAIBench, a benchmarking framework for evaluating Large Language Models (LLMs) and Large Multimodal Models (LMMs) on mobile devices. MobileAIBench assesses models across different sizes, quantization levels, and tasks, measuring latency and resource consumption on real devices.
arXiv Detail & Related papers (2024-06-12T22:58:12Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks. We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z)
SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines. This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.