Related papers: NeuroScalar: A Deep Learning Framework for Fast, Accurate, and In-the-Wild Cycle-Level Performance Prediction

NeuroScalar: A Deep Learning Framework for Fast, Accurate, and In-the-Wild Cycle-Level Performance Prediction

URL: http://arxiv.org/abs/2509.22410v2
Date: Mon, 29 Sep 2025 22:23:10 GMT
Title: NeuroScalar: A Deep Learning Framework for Fast, Accurate, and In-the-Wild Cycle-Level Performance Prediction
Authors: Shayne Wadle, Yanxin Zhang, Vikas Singh, Karthikeyan Sankaralingam,
Abstract summary: This paper introduces a novel deep learning framework for high-fidelity, in-the-wild'' simulation on production hardware.<n>Our core contribution is a DL model trained on microarchitecture-independent features to predict cycle-level performance for hypothetical processor designs.<n>We demonstrate that this framework enables accurate performance analysis and large-scale hardware A/B testing on a massive scale.
Score: 18.863968099669364
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The evaluation of new microprocessor designs is constrained by slow, cycle-accurate simulators that rely on unrepresentative benchmark traces. This paper introduces a novel deep learning framework for high-fidelity, ``in-the-wild'' simulation on production hardware. Our core contribution is a DL model trained on microarchitecture-independent features to predict cycle-level performance for hypothetical processor designs. This unique approach allows the model to be deployed on existing silicon to evaluate future hardware. We propose a complete system featuring a lightweight hardware trace collector and a principled sampling strategy to minimize user impact. This system achieves a simulation speed of 5 MIPS on a commodity GPU, imposing a mere 0.1% performance overhead. Furthermore, our co-designed Neutrino on-chip accelerator improves performance by 85x over the GPU. We demonstrate that this framework enables accurate performance analysis and large-scale hardware A/B testing on a massive scale using real-world applications.

Related papers

Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z)
Omniwise: Predicting GPU Kernels Performance with LLMs [0.06666419797034795]
We introduce Omniwise, the first end-to-end, self-supervised fine-tuning pipeline that applies large language models (LLMs) to GPU kernel performance prediction.<n>It can predict key performance metrics, including memory bandwidth, cache hit rates, GFLOPs, and arithmetic intensity, directly from kernel code without the need for code execution or profiling tools.<n>Our approach achieves over 90% of predictions within 10% relative error on GPU kernels executed on AMD MI250 and MI300X architectures.
arXiv Detail & Related papers (2025-06-25T23:36:44Z)
Phantora: Maximizing Code Reuse in Simulation-based Machine Learning System Performance Estimation [13.326000659635378]
Phantora is a hybrid GPU cluster simulator for performance estimation of machine learning training workloads.<n>It allows direct reuse of ML framework source code in simulation, avoiding the need for reimplementation.<n>Phantora supports three state-of-the-art training frameworks out-of-the-box.
arXiv Detail & Related papers (2025-05-02T22:36:24Z)
Concorde: Fast and Accurate CPU Performance Modeling with Compositional Analytical-ML Fusion [15.06323814625609]
We present Concorde, a new methodology for learning fast and accurate performance models of microarchitectures.<n> Concorde predicts the behavior of a program based on compact performance distributions that capture the impact of different microarchitectural components.<n>Experiments show that Concorde is more than five orders of magnitude faster than a reference cycle-level simulator.
arXiv Detail & Related papers (2025-03-29T13:25:20Z)
Fake Runs, Real Fixes -- Analyzing xPU Performance Through Simulation [4.573673188291683]
We present xPU-Shark, a fine-grained methodology for analyzing ML models at the machine-code level.<n>xPU-Shark captures traces from production deployments running on accelerators and replays them in a modified microarchitecture simulator.<n>We optimize a common communication collective by up to 15% and reduce token generation latency by up to 4.1%.
arXiv Detail & Related papers (2025-03-18T23:15:02Z)
A Realistic Simulation Framework for Analog/Digital Neuromorphic Architectures [73.65190161312555]
ARCANA is a software spiking neural network simulator designed to account for the properties of mixed-signal neuromorphic circuits.<n>We show how the results obtained provide a reliable estimate of the behavior of the spiking neural network trained in software, once deployed in hardware.
arXiv Detail & Related papers (2024-09-23T11:16:46Z)
Enhancing Dropout-based Bayesian Neural Networks with Multi-Exit on FPGA [20.629635991749808]
This paper proposes an algorithm and hardware co-design framework that can generate field-programmable gate array (FPGA)-based accelerators for efficient BayesNNs. At the algorithm level, we propose novel multi-exit dropout-based BayesNNs with reduced computational and memory overheads. At the hardware level, this paper introduces a transformation framework that can generate FPGA-based accelerators for the proposed efficient BayesNNs.
arXiv Detail & Related papers (2024-06-20T17:08:42Z)
GPU-RANC: A CUDA Accelerated Simulation Framework for Neuromorphic Architectures [1.3401966602181168]
We introduce the GPU-based implementation of Reconfigurable Architecture for Neuromorphic Computing (RANC) We demonstrate up to 780 times speedup compared to serial version of the RANC simulator based on a 512 neuromorphic core MNIST inference application.
arXiv Detail & Related papers (2024-04-24T21:08:21Z)
Tao: Re-Thinking DL-based Microarchitecture Simulation [8.501776613988484]
Existing microarchitecture simulators excel and fall short at different aspects. Deep learning (DL)-based simulations are remarkably fast and have acceptable accuracy but fail to provide adequate low-level microarchitectural performance metrics. This paper introduces TAO that redesigns the DL-based simulation with three primary contributions.
arXiv Detail & Related papers (2024-04-16T21:45:10Z)
MAPLE-X: Latency Prediction with Explicit Microprocessor Prior Knowledge [87.41163540910854]
Deep neural network (DNN) latency characterization is a time-consuming process. We propose MAPLE-X which extends MAPLE by incorporating explicit prior knowledge of hardware devices and DNN architecture latency.
arXiv Detail & Related papers (2022-05-25T11:08:20Z)
MAPLE-Edge: A Runtime Latency Predictor for Edge Devices [80.01591186546793]
We propose MAPLE-Edge, an edge device-oriented extension of MAPLE, the state-of-the-art latency predictor for general purpose hardware. Compared to MAPLE, MAPLE-Edge can describe the runtime and target device platform using a much smaller set of CPU performance counters. We also demonstrate that unlike MAPLE which performs best when trained on a pool of devices sharing a common runtime, MAPLE-Edge can effectively generalize across runtimes.
arXiv Detail & Related papers (2022-04-27T14:00:48Z)
MAPLE: Microprocessor A Priori for Latency Estimation [81.91509153539566]
Modern deep neural networks must demonstrate state-of-the-art accuracy while exhibiting low latency and energy consumption. Measuring the latency of every evaluated architecture adds a significant amount of time to the NAS process. We propose Microprocessor A Priori for Estimation Estimation MAPLE that does not rely on transfer learning or domain adaptation.
arXiv Detail & Related papers (2021-11-30T03:52:15Z)
Accelerating Training and Inference of Graph Neural Networks with Fast Sampling and Pipelining [58.10436813430554]
Mini-batch training of graph neural networks (GNNs) requires a lot of computation and data movement. We argue in favor of performing mini-batch training with neighborhood sampling in a distributed multi-GPU environment. We present a sequence of improvements to mitigate these bottlenecks, including a performance-engineered neighborhood sampler. We also conduct an empirical analysis that supports the use of sampling for inference, showing that test accuracies are not materially compromised.
arXiv Detail & Related papers (2021-10-16T02:41:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.