Fake Runs, Real Fixes -- Analyzing xPU Performance Through Simulation
- URL: http://arxiv.org/abs/2503.14781v1
- Date: Tue, 18 Mar 2025 23:15:02 GMT
- Title: Fake Runs, Real Fixes -- Analyzing xPU Performance Through Simulation
- Authors: Ioannis Zarkadas, Amanda Tomlinson, Asaf Cidon, Baris Kasikci, Ofir Weisse,
- Abstract summary: We present xPU-Shark, a fine-grained methodology for analyzing ML models at the machine-code level.<n>xPU-Shark captures traces from production deployments running on accelerators and replays them in a modified microarchitecture simulator.<n>We optimize a common communication collective by up to 15% and reduce token generation latency by up to 4.1%.
- Score: 4.573673188291683
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As models become larger, ML accelerators are a scarce resource whose performance must be continually optimized to improve efficiency. Existing performance analysis tools are coarse grained, and fail to capture model performance at the machine-code level. In addition, these tools often do not provide specific recommendations for optimizations. We present xPU-Shark, a fine-grained methodology for analyzing ML models at the machine-code level that provides actionable optimization suggestions. Our core insight is to use a hardware-level simulator, an artifact of the hardware design process that we can re-purpose for performance analysis. xPU-Shark captures traces from production deployments running on accelerators and replays them in a modified microarchitecture simulator to gain low-level insights into the model's performance. We implement xPU-Shark for our in-house accelerator and used it to analyze the performance of several of our production LLMs, revealing several previously-unknown microarchitecture inefficiencies. Leveraging these insights, we optimize a common communication collective by up to 15% and reduce token generation latency by up to 4.1%.
Related papers
- SparseEval: Efficient Evaluation of Large Language Models by Sparse Optimization [64.95852289011385]
Large language models (LLMs) continue to scale up, their performance on various downstream tasks has significantly improved.<n> evaluating their capabilities has become increasingly expensive, as performing inference on a large number of benchmark samples incurs high computational costs.<n>We propose SparseEval, a method that, for the first time, adopts gradient descent to optimize anchor weights and employs an iterative refinement strategy for anchor selection.
arXiv Detail & Related papers (2026-02-08T11:12:45Z) - Optimizing Agentic Language Model Inference via Speculative Tool Calls [4.106903307413157]
We introduce novel systems optimizations to address performance bottlenecks during the inference process.<n>Our optimizations lead to throughput improvements of several hundred tokens per second when hosting inference for LM agents.<n>We recommend a new "tool cache" API endpoint to enable LM providers to easily adopt these optimizations.
arXiv Detail & Related papers (2025-12-17T18:22:44Z) - DABench-LLM: Standardized and In-Depth Benchmarking of Post-Moore Dataflow AI Accelerators for LLMs [18.46752801066992]
We introduce DABench-LLM, a benchmarking framework for evaluating large language models on dataflow-based accelerators.<n>We validate DABench-LLM on three commodity dataflow accelerators, Cerebras WSE-2, SambaNova RDU, and Graphcore IPU.
arXiv Detail & Related papers (2025-12-04T22:43:14Z) - PerfDojo: Automated ML Library Generation for Heterogeneous Architectures [28.513777562827485]
We introduce PerfLLM, a novel automatic optimization methodology leveraging Large Language Models (LLMs) and Reinforcement Learning (RL)<n>Central to this is PerfDojo, an environment framing optimization as an RL game using a human-readable, mathematically-inspired code representation that guarantees semantic validity through transformations.<n>We demonstrate PerfLLM's ability to achieve significant performance gains across diverse CPU (x86, Arm, RISC-V) and GPU architectures.
arXiv Detail & Related papers (2025-11-05T16:05:26Z) - MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization [103.74675519953898]
Long-chain reflective reasoning is a prerequisite for solving complex real-world problems.<n>We build a benchmark consisting 1,260 samples of 42 challenging synthetic tasks.<n>We generate post-training data and explore learning paradigms for exploiting such data.
arXiv Detail & Related papers (2025-10-09T17:53:58Z) - NeuroScalar: A Deep Learning Framework for Fast, Accurate, and In-the-Wild Cycle-Level Performance Prediction [18.863968099669364]
This paper introduces a novel deep learning framework for high-fidelity, in-the-wild'' simulation on production hardware.<n>Our core contribution is a DL model trained on microarchitecture-independent features to predict cycle-level performance for hypothetical processor designs.<n>We demonstrate that this framework enables accurate performance analysis and large-scale hardware A/B testing on a massive scale.
arXiv Detail & Related papers (2025-09-26T14:36:06Z) - Phantora: Live GPU Cluster Simulation for Machine Learning System Performance Estimation [11.48166268734119]
Phantora is a live GPU cluster simulator for performance estimation.<n>It overcomes several research challenges in integrating an event-driven network simulator with live system execution.<n>Our evaluation results show that Phantora can deliver similar estimation accuracy to the state-of-the-art workload simulation approach with only one GPU.
arXiv Detail & Related papers (2025-05-02T22:36:24Z) - Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training [4.059735204483926]
We propose Lumos, a trace-driven performance modeling and estimation toolkit for large-scale LLM training.
We show that Lumos can replay execution time with an average error of just 3.3%, along with other runtime details, across different models and configurations.
arXiv Detail & Related papers (2025-04-12T18:43:24Z) - Pruning-Based TinyML Optimization of Machine Learning Models for Anomaly Detection in Electric Vehicle Charging Infrastructure [8.29566258132752]
This paper investigates a pruning method for anomaly detection in resource-constrained environments, specifically targeting EVCI.<n> optimized models achieved significant reductions in model size and inference times, with only a marginal impact on their performance.<n> Notably, our findings indicate that, in the context of EVCI, pruning and FS can enhance computational efficiency while retaining critical anomaly detection capabilities.
arXiv Detail & Related papers (2025-03-19T00:18:37Z) - IMPROVE: Iterative Model Pipeline Refinement and Optimization Leveraging LLM Agents [17.301758094000125]
Large language model (LLM) agents have emerged as a promising solution to automate the development of computer vision models.<n>We introduce Iterative Refinement, a novel strategy for LLM-driven ML pipeline design.<n>Iterative Refinement improves stability, interpretability, and overall model performance.
arXiv Detail & Related papers (2025-02-25T01:52:37Z) - LoXR: Performance Evaluation of Locally Executing LLMs on XR Devices [55.33807002543901]
We deploy 17 large language models (LLMs) across four XR devices.
We evaluate performance on four key metrics: performance consistency, processing speed, memory usage, and battery consumption.
arXiv Detail & Related papers (2025-02-13T20:55:48Z) - Numerical Pruning for Efficient Autoregressive Models [87.56342118369123]
This paper focuses on compressing decoder-only transformer-based autoregressive models through structural weight pruning.<n>Specifically, we propose a training-free pruning method that calculates a numerical score with Newton's method for the Attention and modules, respectively.<n>To verify the effectiveness of our method, we provide both theoretical support and extensive experiments.
arXiv Detail & Related papers (2024-12-17T01:09:23Z) - The Impact of Hyperparameters on Large Language Model Inference Performance: An Evaluation of vLLM and HuggingFace Pipelines [6.381783966294295]
Open-source large language models (LLMs) enable developers to create AI-based solutions while maintaining control over aspects such as privacy and compliance.
We analyze the performance, particularly the throughput (tokens generated per unit of time) of 20 LLMs using two inference libraries: vLLM and HuggingFace's pipelines.
arXiv Detail & Related papers (2024-08-02T06:56:59Z) - LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit [55.73370804397226]
Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating large language models.
We present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization.
Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats.
arXiv Detail & Related papers (2024-05-09T11:49:05Z) - Learning Generalizable Program and Architecture Representations for Performance Modeling [0.3277163122167434]
PerfVec is a novel deep learning-based performance modeling framework.
It learns high-dimensional and independent/orthogonal program and microarchitecture representations.
PerfVec yields a foundation model that captures the performance essence of instructions.
arXiv Detail & Related papers (2023-10-25T17:24:01Z) - Cheaply Evaluating Inference Efficiency Metrics for Autoregressive
Transformer APIs [66.30706841821123]
Large language models (LLMs) power many state-of-the-art systems in natural language processing.
LLMs are extremely computationally expensive, even at inference time.
We propose a new metric for comparing inference efficiency across models.
arXiv Detail & Related papers (2023-05-03T21:51:42Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Learning Performance-Improving Code Edits [107.21538852090208]
We introduce a framework for adapting large language models (LLMs) to high-level program optimization.
First, we curate a dataset of performance-improving edits made by human programmers of over 77,000 competitive C++ programming submission pairs.
For prompting, we propose retrieval-based few-shot prompting and chain-of-thought, and for finetuning, these include performance-conditioned generation and synthetic data augmentation based on self-play.
arXiv Detail & Related papers (2023-02-15T18:59:21Z) - Data-Driven Offline Optimization For Architecting Hardware Accelerators [89.68870139177785]
We develop a data-driven offline optimization method for designing hardware accelerators, dubbed PRIME.
PRIME improves performance upon state-of-the-art simulation-driven methods by about 1.54x and 1.20x, while considerably reducing the required total simulation time by 93% and 99%, respectively.
In addition, PRIME also architects effective accelerators for unseen applications in a zero-shot setting, outperforming simulation-based methods by 1.26x.
arXiv Detail & Related papers (2021-10-20T17:06:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.