Related papers: VeriLocc: End-to-End Cross-Architecture Register Allocation via LLM

VeriLocc: End-to-End Cross-Architecture Register Allocation via LLM

URL: http://arxiv.org/abs/2506.17506v1
Date: Fri, 20 Jun 2025 23:08:09 GMT
Title: VeriLocc: End-to-End Cross-Architecture Register Allocation via LLM
Authors: Lesheng Jin, Zhenyuan Ruan, Haohui Mai, Jingbo Shang,
Abstract summary: We introduce VeriLocc, a framework that combines large language models (LLMs) with formal compiler techniques to enable generalizable and verifiable register allocation across GPU architectures.<n> evaluated on matrix multiplication (M) and multi-head attention (MHA), VeriLocc achieves 85-99% single-shot accuracy and near-100% pass@100.<n>Case study shows that VeriLocc discovers more performant assignments than expert-tuned libraries, outperforming rocBLAS by over 10% in runtime.
Score: 39.27052626057448
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern GPUs evolve rapidly, yet production compilers still rely on hand-crafted register allocation heuristics that require substantial re-tuning for each hardware generation. We introduce VeriLocc, a framework that combines large language models (LLMs) with formal compiler techniques to enable generalizable and verifiable register allocation across GPU architectures. VeriLocc fine-tunes an LLM to translate intermediate representations (MIRs) into target-specific register assignments, aided by static analysis for cross-architecture normalization and generalization and a verifier-guided regeneration loop to ensure correctness. Evaluated on matrix multiplication (GEMM) and multi-head attention (MHA), VeriLocc achieves 85-99% single-shot accuracy and near-100% pass@100. Case study shows that VeriLocc discovers more performant assignments than expert-tuned libraries, outperforming rocBLAS by over 10% in runtime.

Related papers

CUDA-LLM: LLMs Can Write Efficient CUDA Kernels [9.287036563375617]
Large Language Models (LLMs) have demonstrated strong capabilities in general-purpose code generation.<n>We propose a novel framework called textbfFeature SearchReinforcement (FSR) FSR jointly optimize compilation and functional correctness.
arXiv Detail & Related papers (2025-06-10T10:51:03Z)
AEQUAM: Accelerating Quantum Algorithm Validation through FPGA-Based Emulation [0.46873264197900916]
AEQUAM is a toolchain that enables faster and more accessible quantum circuit verification.<n>It consists of a compiler that translates OpenQASM 2.0 into RISC-like instructions, Cython software models for selecting number representations and simulating circuits, and a VHDL generator that produces RTL descriptions for FPGA-based hardware emulators.
arXiv Detail & Related papers (2025-06-01T14:17:23Z)
NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding [54.88765757043535]
This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference.<n>Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types with less than 7% computational overhead.<n>The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search.
arXiv Detail & Related papers (2025-05-28T20:43:10Z)
VecTrans: Enhancing Compiler Auto-Vectorization through LLM-Assisted Code Transformations [17.974013479973774]
VecTrans is a framework that leverages large language models to enhance compiler-based code vectorization.<n>VecTrans achieves an geomean speedup of 1.77x and successfully vectorizes 24 of 51 test cases.
arXiv Detail & Related papers (2025-03-25T08:39:35Z)
LLM-Vectorizer: LLM-based Verified Loop Vectorizer [12.048697450464935]
Large-language models (LLMs) can generate vectorized code from scalar programs that process individual array elements. LLMs are capable of producing high performance vectorized code with run-time speedup ranging from 1.1x to 9.4x. Our approach is able to verify 38.2% of vectorizations as correct on the TSVC benchmark dataset.
arXiv Detail & Related papers (2024-06-07T07:04:26Z)
Universal In-Context Approximation By Prompting Fully Recurrent Models [86.61942787684272]
We show that RNNs, LSTMs, GRUs, Linear RNNs, and linear gated architectures can serve as universal in-context approximators. We introduce a programming language called LSRL that compiles to fully recurrent architectures.
arXiv Detail & Related papers (2024-06-03T15:25:13Z)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)
Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster [61.83949316226113]
FastCoT is a model-agnostic framework based on parallel decoding. We show that FastCoT saves inference time by nearly 20% with only a negligible performance drop compared to the regular approach.
arXiv Detail & Related papers (2023-11-14T15:56:18Z)
Enabling Retargetable Optimizing Compilers for Quantum Accelerators via a Multi-Level Intermediate Representation [78.8942067357231]
We present a multi-level quantum-classical intermediate representation (IR) that enables an optimizing, retargetable, ahead-of-time compiler. We support the entire gate-based OpenQASM 3 language and provide custom extensions for common quantum programming patterns and improved syntax. Our work results in compile times that are 1000x faster than standard Pythonic approaches, and 5-10x faster than comparative standalone quantum language compilers.
arXiv Detail & Related papers (2021-09-01T17:29:47Z)
A Case Study of LLVM-Based Analysis for Optimizing SIMD Code Generation [0.0]
This paper presents a methodology for using LLVM-based tools to tune the DCA++ application that targets the new ARM A64FX processor. By applying these code changes, codespeed was increased by 1.98X and 78 GFlops were achieved on the A64FX processor.
arXiv Detail & Related papers (2021-06-27T22:38:16Z)
PolyDL: Polyhedral Optimizations for Creation of High Performance DL primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives. We develop novel data reuse analysis algorithms using the polyhedral model. We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.