Fake Runs, Real Fixes -- Analyzing xPU Performance Through Simulation
- URL: http://arxiv.org/abs/2503.14781v1
- Date: Tue, 18 Mar 2025 23:15:02 GMT
- Title: Fake Runs, Real Fixes -- Analyzing xPU Performance Through Simulation
- Authors: Ioannis Zarkadas, Amanda Tomlinson, Asaf Cidon, Baris Kasikci, Ofir Weisse,
- Abstract summary: We present xPU-Shark, a fine-grained methodology for analyzing ML models at the machine-code level.<n>xPU-Shark captures traces from production deployments running on accelerators and replays them in a modified microarchitecture simulator.<n>We optimize a common communication collective by up to 15% and reduce token generation latency by up to 4.1%.
- Score: 4.573673188291683
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As models become larger, ML accelerators are a scarce resource whose performance must be continually optimized to improve efficiency. Existing performance analysis tools are coarse grained, and fail to capture model performance at the machine-code level. In addition, these tools often do not provide specific recommendations for optimizations. We present xPU-Shark, a fine-grained methodology for analyzing ML models at the machine-code level that provides actionable optimization suggestions. Our core insight is to use a hardware-level simulator, an artifact of the hardware design process that we can re-purpose for performance analysis. xPU-Shark captures traces from production deployments running on accelerators and replays them in a modified microarchitecture simulator to gain low-level insights into the model's performance. We implement xPU-Shark for our in-house accelerator and used it to analyze the performance of several of our production LLMs, revealing several previously-unknown microarchitecture inefficiencies. Leveraging these insights, we optimize a common communication collective by up to 15% and reduce token generation latency by up to 4.1%.
Related papers
- Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training [4.059735204483926]
We propose Lumos, a trace-driven performance modeling and estimation toolkit for large-scale LLM training.
We show that Lumos can replay execution time with an average error of just 3.3%, along with other runtime details, across different models and configurations.
arXiv Detail & Related papers (2025-04-12T18:43:24Z) - Pruning-Based TinyML Optimization of Machine Learning Models for Anomaly Detection in Electric Vehicle Charging Infrastructure [8.29566258132752]
This paper investigates a pruning method for anomaly detection in resource-constrained environments, specifically targeting EVCI.<n> optimized models achieved significant reductions in model size and inference times, with only a marginal impact on their performance.<n> Notably, our findings indicate that, in the context of EVCI, pruning and FS can enhance computational efficiency while retaining critical anomaly detection capabilities.
arXiv Detail & Related papers (2025-03-19T00:18:37Z) - IMPROVE: Iterative Model Pipeline Refinement and Optimization Leveraging LLM Agents [17.301758094000125]
Large language model (LLM) agents have emerged as a promising solution to automate the development of computer vision models.<n>We introduce Iterative Refinement, a novel strategy for LLM-driven ML pipeline design.<n>Iterative Refinement improves stability, interpretability, and overall model performance.
arXiv Detail & Related papers (2025-02-25T01:52:37Z) - LoXR: Performance Evaluation of Locally Executing LLMs on XR Devices [55.33807002543901]
We deploy 17 large language models (LLMs) across four XR devices.
We evaluate performance on four key metrics: performance consistency, processing speed, memory usage, and battery consumption.
arXiv Detail & Related papers (2025-02-13T20:55:48Z) - Numerical Pruning for Efficient Autoregressive Models [87.56342118369123]
This paper focuses on compressing decoder-only transformer-based autoregressive models through structural weight pruning.<n>Specifically, we propose a training-free pruning method that calculates a numerical score with Newton's method for the Attention and modules, respectively.<n>To verify the effectiveness of our method, we provide both theoretical support and extensive experiments.
arXiv Detail & Related papers (2024-12-17T01:09:23Z) - The Impact of Hyperparameters on Large Language Model Inference Performance: An Evaluation of vLLM and HuggingFace Pipelines [6.381783966294295]
Open-source large language models (LLMs) enable developers to create AI-based solutions while maintaining control over aspects such as privacy and compliance.
We analyze the performance, particularly the throughput (tokens generated per unit of time) of 20 LLMs using two inference libraries: vLLM and HuggingFace's pipelines.
arXiv Detail & Related papers (2024-08-02T06:56:59Z) - LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit [55.73370804397226]
Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating large language models.
We present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization.
Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats.
arXiv Detail & Related papers (2024-05-09T11:49:05Z) - Learning Generalizable Program and Architecture Representations for Performance Modeling [0.3277163122167434]
PerfVec is a novel deep learning-based performance modeling framework.
It learns high-dimensional and independent/orthogonal program and microarchitecture representations.
PerfVec yields a foundation model that captures the performance essence of instructions.
arXiv Detail & Related papers (2023-10-25T17:24:01Z) - Cheaply Evaluating Inference Efficiency Metrics for Autoregressive
Transformer APIs [66.30706841821123]
Large language models (LLMs) power many state-of-the-art systems in natural language processing.
LLMs are extremely computationally expensive, even at inference time.
We propose a new metric for comparing inference efficiency across models.
arXiv Detail & Related papers (2023-05-03T21:51:42Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Learning Performance-Improving Code Edits [107.21538852090208]
We introduce a framework for adapting large language models (LLMs) to high-level program optimization.
First, we curate a dataset of performance-improving edits made by human programmers of over 77,000 competitive C++ programming submission pairs.
For prompting, we propose retrieval-based few-shot prompting and chain-of-thought, and for finetuning, these include performance-conditioned generation and synthetic data augmentation based on self-play.
arXiv Detail & Related papers (2023-02-15T18:59:21Z) - Data-Driven Offline Optimization For Architecting Hardware Accelerators [89.68870139177785]
We develop a data-driven offline optimization method for designing hardware accelerators, dubbed PRIME.
PRIME improves performance upon state-of-the-art simulation-driven methods by about 1.54x and 1.20x, while considerably reducing the required total simulation time by 93% and 99%, respectively.
In addition, PRIME also architects effective accelerators for unseen applications in a zero-shot setting, outperforming simulation-based methods by 1.26x.
arXiv Detail & Related papers (2021-10-20T17:06:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.