Related papers: Understanding Accelerator Compilers via Performance Profiling

Understanding Accelerator Compilers via Performance Profiling

URL: http://arxiv.org/abs/2511.19764v1
Date: Mon, 24 Nov 2025 22:40:11 GMT
Title: Understanding Accelerator Compilers via Performance Profiling
Authors: Ayaka Yorihiro, Griffin Berlstein, Pedro Pontes García, Kevin Laeufer, Adrian Sampson,
Abstract summary: Accelerator design languages (ADLs) are high-level languages that compile to hardware units.<n>We introduce Petal, a cycle-level tool for understanding how the compiler's decisions affect performance.<n>We show that Petal's cycle-level profiles can identify performance problems in existing designs.
Score: 1.1841612917872066
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Accelerator design languages (ADLs), high-level languages that compile to hardware units, help domain experts quickly design efficient application-specific hardware. ADL compilers optimize datapaths and convert software-like control flow constructs into control paths. Such compilers are necessarily complex and often unpredictable: they must bridge the wide semantic gap between high-level semantics and cycle-level schedules, and they typically rely on advanced heuristics to optimize circuits. The resulting performance can be difficult to control, requiring guesswork to find and resolve performance problems in the generated hardware. We conjecture that ADL compilers will never be perfect: some performance unpredictability is endemic to the problem they solve. In lieu of compiler perfection, we argue for compiler understanding tools that give ADL programmers insight into how the compiler's decisions affect performance. We introduce Petal, a cycle-level Petal for the Calyx intermediate language (IL). Petal instruments the Calyx code with probes and then analyzes the trace from a register-transfer-level simulation. It maps the events in the trace back to high-level control constructs in the Calyx code to track the clock cycles when each construct was active. Using case studies, we demonstrate that Petal's cycle-level profiles can identify performance problems in existing accelerator designs. We show that these insights can also guide developers toward optimizations that the compiler was unable to perform automatically, including a reduction by 46.9\% of total cycles for one application.

Related papers

PEAK: A Performance Engineering AI-Assistant for GPU Kernels Powered by Natural Language Transformations [0.8699280339422538]
We introduce PEAK, a Performance Engineering AI-Assistant for Kernels powered by natural language transformations.<n>We show that our implementations are competitive with vendor libraries when available, and for HLSL (without a library) our documented FLOPS.
arXiv Detail & Related papers (2025-12-22T04:15:24Z)
XTC, A Research Platform for Optimizing AI Workload Operators [0.0]
We introduce XTC, a platform that unifies scheduling and performance evaluation across compilers.<n>With its common API and reproducible measurement framework, XTC enables portable experimentation and accelerates research on optimization strategies.
arXiv Detail & Related papers (2025-12-18T13:24:44Z)
Context-Guided Decompilation: A Step Towards Re-executability [50.71992919223209]
Binary decompilation plays an important role in software security analysis, reverse engineering and malware understanding.<n>Recent advances in large language models (LLMs) have enabled neural decompilation, but the generated code is typically only semantically plausible.<n>We propose ICL4Decomp, a hybrid decompilation framework that leverages in-context learning (ICL) to guide LLMs toward generating re-executable source code.
arXiv Detail & Related papers (2025-11-03T17:21:39Z)
Fun with flags: How Compilers Break and Fix Constant-Time Code [0.0]
We analyze how compiler optimizations break constant-time code.<n>Key insight is that a small set of passes are at the root of most leaks.<n>We propose an original and practical mitigation that requires no source code modification or custom compiler.
arXiv Detail & Related papers (2025-07-08T15:52:17Z)
RAG-Based Fuzzing of Cross-Architecture Compilers [0.8302146576157498]
OneAPI is an open standard that supports cross-architecture software development with minimal effort from developers.<n>OneAPI brings DPC++ and C++ compilers which need to be thoroughly tested to verify their correctness, reliability, and security.<n>This paper proposes a large-language model (LLM)-based compiler fuzzing tool that integrates the concept of retrieval-augmented generation (RAG)
arXiv Detail & Related papers (2025-04-11T20:46:52Z)
Finding Missed Code Size Optimizations in Compilers using LLMs [1.90019787465083]
We develop a novel testing approach which combines large language models with a series of differential testing strategies.<n>Our approach requires fewer than 150 lines of code to implement.<n>To date we have reported 24 confirmed bugs in production compilers.
arXiv Detail & Related papers (2024-12-31T21:47:46Z)
CompilerDream: Learning a Compiler World Model for General Code Optimization [58.87557583347996]
We introduce CompilerDream, a model-based reinforcement learning approach to general code optimization.<n>It comprises a compiler world model that accurately simulates the intrinsic properties of optimization passes and an agent trained on this model to produce effective optimization strategies.<n>It excels across diverse datasets, surpassing LLVM's built-in optimizations and other state-of-the-art methods in both settings of value prediction and end-to-end code optimization.
arXiv Detail & Related papers (2024-04-24T09:20:33Z)
Using the Abstract Computer Architecture Description Language to Model AI Hardware Accelerators [77.89070422157178]
Manufacturers of AI-integrated products face a critical challenge: selecting an accelerator that aligns with their product's performance requirements. The Abstract Computer Architecture Description Language (ACADL) is a concise formalization of computer architecture block diagrams. In this paper, we demonstrate how to use the ACADL to model AI hardware accelerators, use their ACADL description to map DNNs onto them, and explain the timing simulation semantics to gather performance results.
arXiv Detail & Related papers (2024-01-30T19:27:16Z)
Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels. We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z)
Learning Performance-Improving Code Edits [107.21538852090208]
We introduce a framework for adapting large language models (LLMs) to high-level program optimization. First, we curate a dataset of performance-improving edits made by human programmers of over 77,000 competitive C++ programming submission pairs. For prompting, we propose retrieval-based few-shot prompting and chain-of-thought, and for finetuning, these include performance-conditioned generation and synthetic data augmentation based on self-play.
arXiv Detail & Related papers (2023-02-15T18:59:21Z)
CompilerGym: Robust, Performant Compiler Optimization Environments for AI Research [26.06438868492976]
Interest in applying Artificial Intelligence (AI) techniques to compiler optimizations is increasing rapidly. But compiler research has a high entry barrier. We introduce CompilerGym, a set of environments for real world compiler optimization tasks. We also introduce a toolkit for exposing new optimization tasks to compiler researchers.
arXiv Detail & Related papers (2021-09-17T01:02:27Z)
StreamBlocks: A compiler for heterogeneous dataflow computing (technical report) [1.5293427903448022]
This work introduces StreamBlocks, an open-source compiler and runtime that uses the CAL dataflow programming language to partition computations across platforms. StreamBlocks supports exploring the design space with a profile-guided tool that helps identify the best hardware-software partitions.
arXiv Detail & Related papers (2021-07-20T08:46:47Z)
Extending C++ for Heterogeneous Quantum-Classical Computing [56.782064931823015]
qcor is a language extension to C++ and compiler implementation that enables heterogeneous quantum-classical programming, compilation, and execution in a single-source context. Our work provides a first-of-its-kind C++ compiler enabling high-level quantum kernel (function) expression in a quantum-language manner.
arXiv Detail & Related papers (2020-10-08T12:49:07Z)
PolyDL: Polyhedral Optimizations for Creation of High Performance DL primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives. We develop novel data reuse analysis algorithms using the polyhedral model. We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.