Related papers: Concorde: Fast and Accurate CPU Performance Modeling with Compositional Analytical-ML Fusion

Concorde: Fast and Accurate CPU Performance Modeling with Compositional Analytical-ML Fusion

URL: http://arxiv.org/abs/2503.23076v1
Date: Sat, 29 Mar 2025 13:25:20 GMT
Title: Concorde: Fast and Accurate CPU Performance Modeling with Compositional Analytical-ML Fusion
Authors: Arash Nasr-Esfahany, Mohammad Alizadeh, Victor Lee, Hanna Alam, Brett W. Coon, David Culler, Vidushi Dadu, Martin Dixon, Henry M. Levy, Santosh Pandey, Parthasarathy Ranganathan, Amir Yazdanbakhsh,
Abstract summary: We present Concorde, a new methodology for learning fast and accurate performance models of microarchitectures.<n> Concorde predicts the behavior of a program based on compact performance distributions that capture the impact of different microarchitectural components.<n>Experiments show that Concorde is more than five orders of magnitude faster than a reference cycle-level simulator.
Score: 15.06323814625609
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Cycle-level simulators such as gem5 are widely used in microarchitecture design, but they are prohibitively slow for large-scale design space explorations. We present Concorde, a new methodology for learning fast and accurate performance models of microarchitectures. Unlike existing simulators and learning approaches that emulate each instruction, Concorde predicts the behavior of a program based on compact performance distributions that capture the impact of different microarchitectural components. It derives these performance distributions using simple analytical models that estimate bounds on performance induced by each microarchitectural component, providing a simple yet rich representation of a program's performance characteristics across a large space of microarchitectural parameters. Experiments show that Concorde is more than five orders of magnitude faster than a reference cycle-level simulator, with about 2% average Cycles-Per-Instruction (CPI) prediction error across a range of SPEC, open-source, and proprietary benchmarks. This enables rapid design-space exploration and performance sensitivity analyses that are currently infeasible, e.g., in about an hour, we conducted a first-of-its-kind fine-grained performance attribution to different microarchitectural components across a diverse set of programs, requiring nearly 150 million CPI evaluations.

Related papers

NeuroScalar: A Deep Learning Framework for Fast, Accurate, and In-the-Wild Cycle-Level Performance Prediction [18.863968099669364]
This paper introduces a novel deep learning framework for high-fidelity, in-the-wild'' simulation on production hardware.<n>Our core contribution is a DL model trained on microarchitecture-independent features to predict cycle-level performance for hypothetical processor designs.<n>We demonstrate that this framework enables accurate performance analysis and large-scale hardware A/B testing on a massive scale.
arXiv Detail & Related papers (2025-09-26T14:36:06Z)
MiniCPM4: Ultra-Efficient LLMs on End Devices [124.73631357883228]
MiniCPM4 is a highly efficient large language model (LLM) designed explicitly for end-side devices.<n>We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems.<n>MiniCPM4 is available in two versions, with 0.5B and 8B parameters, respectively.
arXiv Detail & Related papers (2025-06-09T16:16:50Z)
CARL: Causality-guided Architecture Representation Learning for an Interpretable Performance Predictor [6.014777261874645]
Performance predictors have emerged as a promising method to accelerate the evaluation stage of neural architecture search (NAS)<n>We propose a Causality-guided Architecture Representation Learning (CARL) method aiming to separate critical (causal) and redundant (non-causal) features of architectures for generalizable architecture performance prediction.<n>Experiments on five NAS search spaces demonstrate the state-of-the-art accuracy and superior interpretability of CARL.
arXiv Detail & Related papers (2025-06-04T14:30:55Z)
Leveraging Neural Graph Compilers in Machine Learning Research for Edge-Cloud Systems [5.241450170761232]
This work presents a comprehensive evaluation of neural network graph compilers across heterogeneous hardware platforms. Our systematic analysis reveals that graph compilers exhibit performance patterns highly dependent on both neural architecture and batch sizes. We introduce novel metrics to quantify a compiler's ability to mitigate performance friction as batch size increases.
arXiv Detail & Related papers (2025-04-28T19:02:16Z)
ZeroLM: Data-Free Transformer Architecture Search for Language Models [54.83882149157548]
Current automated proxy discovery approaches suffer from extended search times, susceptibility to data overfitting, and structural complexity.<n>This paper introduces a novel zero-cost proxy methodology that quantifies model capacity through efficient weight statistics.<n>Our evaluation demonstrates the superiority of this approach, achieving a Spearman's rho of 0.76 and Kendall's tau of 0.53 on the FlexiBERT benchmark.
arXiv Detail & Related papers (2025-03-24T13:11:22Z)
Accelerating TinyML Inference on Microcontrollers through Approximate Kernels [3.566060656925169]
In this work, we combine approximate computing and software kernel design to accelerate the inference of approximate CNN models on microcontrollers. Our evaluation on an STM32-Nucleo board and 2 popular CNNs trained on the CIFAR-10 dataset shows that, compared to state-of-the-art exact inference, our solutions can feature on average 21% latency reduction.
arXiv Detail & Related papers (2024-09-25T11:10:33Z)
Automatic Generation of Fast and Accurate Performance Models for Deep Neural Network Accelerators [33.18173790144853]
We present an automated generation approach for fast performance models to accurately estimate the latency of a Deep Neural Networks (DNNs) We modeled representative DNN accelerators such as Gemmini, UltraTrail, Plasticine-derived, and a parameterizable systolic array. We evaluate only 154 loop kernel iterations to estimate the performance for 4.19 billion instructions achieving a significant speedup.
arXiv Detail & Related papers (2024-09-13T07:27:55Z)
Mechanistic Design and Scaling of Hybrid Architectures [114.3129802943915]
We identify and test new hybrid architectures constructed from a variety of computational primitives. We experimentally validate the resulting architectures via an extensive compute-optimal and a new state-optimal scaling law analysis. We find MAD synthetics to correlate with compute-optimal perplexity, enabling accurate evaluation of new architectures.
arXiv Detail & Related papers (2024-03-26T16:33:12Z)
Learning Generalizable Program and Architecture Representations for Performance Modeling [0.3277163122167434]
PerfVec is a novel deep learning-based performance modeling framework. It learns high-dimensional and independent/orthogonal program and microarchitecture representations. PerfVec yields a foundation model that captures the performance essence of instructions.
arXiv Detail & Related papers (2023-10-25T17:24:01Z)
ArchGym: An Open-Source Gymnasium for Machine Learning Assisted Architecture Design [52.57999109204569]
ArchGym is an open-source framework that connects diverse search algorithms to architecture simulators. We evaluate ArchGym across multiple vanilla and domain-specific search algorithms in designing custom memory controller, deep neural network accelerators, and custom SOC for AR/VR workloads.
arXiv Detail & Related papers (2023-06-15T06:41:23Z)
Squeezeformer: An Efficient Transformer for Automatic Speech Recognition [99.349598600887]
Conformer is the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture. We propose the Squeezeformer model, which consistently outperforms the state-of-the-art ASR models under the same training schemes.
arXiv Detail & Related papers (2022-06-02T06:06:29Z)
Accelerating Training and Inference of Graph Neural Networks with Fast Sampling and Pipelining [58.10436813430554]
Mini-batch training of graph neural networks (GNNs) requires a lot of computation and data movement. We argue in favor of performing mini-batch training with neighborhood sampling in a distributed multi-GPU environment. We present a sequence of improvements to mitigate these bottlenecks, including a performance-engineered neighborhood sampler. We also conduct an empirical analysis that supports the use of sampling for inference, showing that test accuracies are not materially compromised.
arXiv Detail & Related papers (2021-10-16T02:41:35Z)
Efficient Micro-Structured Weight Unification and Pruning for Neural Network Compression [56.83861738731913]
Deep Neural Network (DNN) models are essential for practical applications, especially for resource limited devices. Previous unstructured or structured weight pruning methods can hardly truly accelerate inference. We propose a generalized weight unification framework at a hardware compatible micro-structured level to achieve high amount of compression and acceleration.
arXiv Detail & Related papers (2021-06-15T17:22:59Z)
STONNE: A Detailed Architectural Simulator for Flexible Neural Network Accelerators [5.326345912766044]
STONNE is a cycle-accurate, highly-modular and highly-extensible simulation framework. We show how it can closely approach the performance results of the publicly available BSV-coded MAERI implementation.
arXiv Detail & Related papers (2020-06-10T19:20:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.