VeriLocc: End-to-End Cross-Architecture Register Allocation via LLM
- URL: http://arxiv.org/abs/2506.17506v1
- Date: Fri, 20 Jun 2025 23:08:09 GMT
- Title: VeriLocc: End-to-End Cross-Architecture Register Allocation via LLM
- Authors: Lesheng Jin, Zhenyuan Ruan, Haohui Mai, Jingbo Shang,
- Abstract summary: We introduce VeriLocc, a framework that combines large language models (LLMs) with formal compiler techniques to enable generalizable and verifiable register allocation across GPU architectures.<n> evaluated on matrix multiplication (M) and multi-head attention (MHA), VeriLocc achieves 85-99% single-shot accuracy and near-100% pass@100.<n>Case study shows that VeriLocc discovers more performant assignments than expert-tuned libraries, outperforming rocBLAS by over 10% in runtime.
- Score: 39.27052626057448
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern GPUs evolve rapidly, yet production compilers still rely on hand-crafted register allocation heuristics that require substantial re-tuning for each hardware generation. We introduce VeriLocc, a framework that combines large language models (LLMs) with formal compiler techniques to enable generalizable and verifiable register allocation across GPU architectures. VeriLocc fine-tunes an LLM to translate intermediate representations (MIRs) into target-specific register assignments, aided by static analysis for cross-architecture normalization and generalization and a verifier-guided regeneration loop to ensure correctness. Evaluated on matrix multiplication (GEMM) and multi-head attention (MHA), VeriLocc achieves 85-99% single-shot accuracy and near-100% pass@100. Case study shows that VeriLocc discovers more performant assignments than expert-tuned libraries, outperforming rocBLAS by over 10% in runtime.
Related papers
- CUDA-LLM: LLMs Can Write Efficient CUDA Kernels [9.287036563375617]
Large Language Models (LLMs) have demonstrated strong capabilities in general-purpose code generation.<n>We propose a novel framework called textbfFeature SearchReinforcement (FSR) FSR jointly optimize compilation and functional correctness.
arXiv Detail & Related papers (2025-06-10T10:51:03Z) - AEQUAM: Accelerating Quantum Algorithm Validation through FPGA-Based Emulation [0.46873264197900916]
AEQUAM is a toolchain that enables faster and more accessible quantum circuit verification.<n>It consists of a compiler that translates OpenQASM 2.0 into RISC-like instructions, Cython software models for selecting number representations and simulating circuits, and a VHDL generator that produces RTL descriptions for FPGA-based hardware emulators.
arXiv Detail & Related papers (2025-06-01T14:17:23Z) - NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding [54.88765757043535]
This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference.<n>Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types with less than 7% computational overhead.<n>The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search.
arXiv Detail & Related papers (2025-05-28T20:43:10Z) - VecTrans: Enhancing Compiler Auto-Vectorization through LLM-Assisted Code Transformations [17.974013479973774]
VecTrans is a framework that leverages large language models to enhance compiler-based code vectorization.<n>VecTrans achieves an geomean speedup of 1.77x and successfully vectorizes 24 of 51 test cases.
arXiv Detail & Related papers (2025-03-25T08:39:35Z) - LLM-Vectorizer: LLM-based Verified Loop Vectorizer [12.048697450464935]
Large-language models (LLMs) can generate vectorized code from scalar programs that process individual array elements.
LLMs are capable of producing high performance vectorized code with run-time speedup ranging from 1.1x to 9.4x.
Our approach is able to verify 38.2% of vectorizations as correct on the TSVC benchmark dataset.
arXiv Detail & Related papers (2024-06-07T07:04:26Z) - Universal In-Context Approximation By Prompting Fully Recurrent Models [86.61942787684272]
We show that RNNs, LSTMs, GRUs, Linear RNNs, and linear gated architectures can serve as universal in-context approximators.
We introduce a programming language called LSRL that compiles to fully recurrent architectures.
arXiv Detail & Related papers (2024-06-03T15:25:13Z) - ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks.
To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z) - Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster [61.83949316226113]
FastCoT is a model-agnostic framework based on parallel decoding.
We show that FastCoT saves inference time by nearly 20% with only a negligible performance drop compared to the regular approach.
arXiv Detail & Related papers (2023-11-14T15:56:18Z) - Enabling Retargetable Optimizing Compilers for Quantum Accelerators via
a Multi-Level Intermediate Representation [78.8942067357231]
We present a multi-level quantum-classical intermediate representation (IR) that enables an optimizing, retargetable, ahead-of-time compiler.
We support the entire gate-based OpenQASM 3 language and provide custom extensions for common quantum programming patterns and improved syntax.
Our work results in compile times that are 1000x faster than standard Pythonic approaches, and 5-10x faster than comparative standalone quantum language compilers.
arXiv Detail & Related papers (2021-09-01T17:29:47Z) - A Case Study of LLVM-Based Analysis for Optimizing SIMD Code Generation [0.0]
This paper presents a methodology for using LLVM-based tools to tune the DCA++ application that targets the new ARM A64FX processor.
By applying these code changes, codespeed was increased by 1.98X and 78 GFlops were achieved on the A64FX processor.
arXiv Detail & Related papers (2021-06-27T22:38:16Z) - PolyDL: Polyhedral Optimizations for Creation of High Performance DL
primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives.
We develop novel data reuse analysis algorithms using the polyhedral model.
We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.