LLMPerf: GPU Performance Modeling meets Large Language Models
- URL: http://arxiv.org/abs/2503.11244v1
- Date: Fri, 14 Mar 2025 09:52:30 GMT
- Title: LLMPerf: GPU Performance Modeling meets Large Language Models
- Authors: Khoi N. M. Nguyen, Hoang Duy Nguyen Do, Huyen Thao Le, Thanh Tuan Dao,
- Abstract summary: Large Language Models (LLMs) have demonstrated their effectiveness in addressing diverse programming challenges.<n>Our work establishes a connection between LLMs and performance modeling, employing the LLM as a performance estimator.<n>Our model achieves a mean absolute percentage error of $46.1%$ on a set of publicly available OpenCL programs.
- Score: 0.22499166814992436
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Performance modeling, a pivotal domain in program cost analysis, currently relies on manually crafted models constrained by various program and hardware limitations, especially in the intricate landscape of GPGPU. Meanwhile, Large Language Models (LLMs) have demonstrated their effectiveness in addressing diverse programming challenges. Our work establishes a connection between LLMs and performance modeling, employing the LLM as a performance estimator. Through experimental exploration with carefully designed large-scale OpenCL datasets, we highlight the potential capability as well as the main difficulties of using LLMs in handling performance modeling tasks for OpenCL device source programs. As the first study for this line of work, our LLM-based performance model achieves a mean absolute percentage error of $24.25\%$ for a large-scale generated validation set. On a set of publicly available OpenCL programs, our model achieves a mean absolute percentage error of $46.1\%$.
Related papers
- MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints [7.287566040274871]
MoE-Lens is an inference system designed through holistic performance modeling for resource-constrained environments.
It captures the system execution mechanisms to identify the key hardware bottlenecks and accurately predict the achievable throughput.
evaluated on diverse MoE models and datasets, MoE-Lens outperforms the state-of-the-art solution by 4.6x on average (up to 25.5x)
arXiv Detail & Related papers (2025-04-12T21:26:56Z) - Cost-Optimal Grouped-Query Attention for Long-Context LLMs [64.90662568387683]
Building effective Transformer-based large language models (LLMs) has recently become a research focus.
We compare models with different parameter sizes, context lengths, and attention head configurations in terms of model performance, computational cost, and memory cost.
Our studies show that, when processing sufficiently long sequences, a larger model with fewer attention heads can achieve a lower loss while incurring lower computational and memory costs.
arXiv Detail & Related papers (2025-03-12T17:50:42Z) - LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators [1.1028525384019312]
Large Language Models (LLMs) have propelled groundbreaking advancements across several domains and are commonly used for text generation applications.
We introduce LLM-Inference-Bench, a comprehensive benchmarking suite to evaluate the hardware inference performance of LLMs.
Our benchmarking results reveal the strengths and limitations of various models, hardware platforms, and inference frameworks.
arXiv Detail & Related papers (2024-10-31T18:34:59Z) - Performance Law of Large Language Models [58.32539851241063]
Performance law can be used to guide the choice of LLM architecture and the effective allocation of computational resources.
Performance law can be used to guide the choice of LLM architecture and the effective allocation of computational resources without extensive experiments.
arXiv Detail & Related papers (2024-08-19T11:09:12Z) - Understanding the Performance and Estimating the Cost of LLM Fine-Tuning [9.751868268608675]
Fine-tuning Large Language Models (LLMs) for specific tasks in a cost-effective manner.
In this paper, we characterize sparse Mixture of Experts (MoE) based LLM fine-tuning to understand their accuracy and runtime performance.
We also develop and validate an analytical model to estimate the cost of LLM fine-tuning on the cloud.
arXiv Detail & Related papers (2024-08-08T16:26:07Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks.
However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs.
We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z) - Knowledge Fusion of Large Language Models [73.28202188100646]
This paper introduces the notion of knowledge fusion for large language models (LLMs)
We externalize their collective knowledge and unique strengths, thereby elevating the capabilities of the target model beyond those of any individual source LLM.
Our findings confirm that the fusion of LLMs can improve the performance of the target model across a range of capabilities such as reasoning, commonsense, and code generation.
arXiv Detail & Related papers (2024-01-19T05:02:46Z) - LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation.
We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset.
Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.