Related papers: CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark

CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark

URL: http://arxiv.org/abs/2505.16968v3
Date: Thu, 29 May 2025 05:44:32 GMT
Title: CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark
Authors: Ahmed Heakl, Sarim Hashmi, Gustavo Bertolo Stahl, Seung Hun Eddie Han, Salman Khan, Abdulrahman Mahmoud,
Abstract summary: We introduce CASS, the first large-scale dataset and model suite for cross-architecture GPU code transpilation.<n>The dataset comprises 70k verified code pairs across host and device.<n>We train the CASS family of domain-specific language models, achieving 95% source translation accuracy and 37.5% assembly translation accuracy.
Score: 8.97422045170539
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce CASS, the first large-scale dataset and model suite for cross-architecture GPU code transpilation, targeting both source-level (CUDA <--> HIP) and assembly-level (Nvidia SASS <--> AMD RDNA3) translation. The dataset comprises 70k verified code pairs across host and device, addressing a critical gap in low-level GPU code portability. Leveraging this resource, we train the CASS family of domain-specific language models, achieving 95% source translation accuracy and 37.5% assembly translation accuracy, substantially outperforming commercial baselines such as GPT-4o, Claude, and Hipify. Our generated code matches native performance in over 85% of test cases, preserving runtime and memory behavior. To support rigorous evaluation, we introduce CASS-Bench, a curated benchmark spanning 16 GPU domains with ground-truth execution. All data, models, and evaluation tools are released as open source to foster progress in GPU compiler tooling, binary compatibility, and LLM-guided hardware translation.

Related papers

GPU-Accelerated Interpretable Generalization for Rapid Cyberattack Detection and Forensics [0.0]
IG mechanism recently published in IEEE Transactions on Information Forensics and Security delivers state-of-the-art, evidence-based intrusion detection.<n>We present IG-GPU, a PyTorch re-architecture that offloads all pairwise intersections and subset evaluations to commodity GPU.<n>In 15k-record NSL-KDD dataset, IG-GPU shows a 116-fold speed-up over the multi-core CPU implementation of IG.
arXiv Detail & Related papers (2025-07-16T12:38:19Z)
NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding [54.88765757043535]
This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference.<n>Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types with less than 7% computational overhead.<n>The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search.
arXiv Detail & Related papers (2025-05-28T20:43:10Z)
Can Large Language Models Predict Parallel Code Performance? [1.5221392705893568]
This paper explores whether Large Language Models (LLMs) can offer an alternative approach for GPU performance prediction without relying on hardware.<n>LLMs have a strong understanding of the Roofline model, achieving 100% classification accuracy when provided with explicit profiling data.<n>Our findings suggest that with better datasets and prompt strategies, LLMs could become practical tools for HPC roofline analysis and performance portability.
arXiv Detail & Related papers (2025-05-06T21:41:20Z)
LLM-Driven Multi-step Translation from C to Rust using Static Analysis [27.122409727034192]
Translating software written in legacy languages to modern languages, such as C to Rust, has significant benefits in improving memory safety.<n>We propose SACTOR, an LLM-driven C-to-Rust zero-shot translation tool using a two-step translation methodology.<n>SACTOR produces more natural and Rust-compliant translations compared to existing methods.
arXiv Detail & Related papers (2025-03-16T14:05:26Z)
Enhancing Cross-Language Code Translation via Task-Specific Embedding Alignment in Retrieval-Augmented Generation [1.64043572114825]
We introduce a novel method to enhance cross-language code translation from Fortran to C++ by integrating task-specific embedding alignment.<n>Our strategy aligns the retrieval model directly with the objective of maximizing translation quality, as quantified by the CodeBLEU metric.<n>By integrating these CodeBLEU-optimized embeddings into the RAG framework, our approach significantly enhances both retrieval accuracy and code generation quality.
arXiv Detail & Related papers (2024-12-06T16:22:32Z)
KGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution [59.20933707301566]
Large Language Models (LLMs) are consistently improving at increasingly realistic software engineering (SE) tasks. In real-world software stacks, significant SE effort is spent developing foundational system software like the Linux kernel. To evaluate if ML models are useful while developing such large-scale systems-level software, we introduce kGym and kBench.
arXiv Detail & Related papers (2024-07-02T21:44:22Z)
INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient. We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture. We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z)
Creating a Dataset for High-Performance Computing Code Translation using LLMs: A Bridge Between OpenMP Fortran and C++ [7.872005563259838]
The effectiveness of our dataset is assessed using both quantitative (CodeBLEU) and qualitative (human evaluation) methods. Models without prior coding knowledge experienced a boost of $mathbftimes5.1$ in CodeBLEU scores. Models with some coding familiarity saw an impressive $mathbftimes9.9$-fold increase.
arXiv Detail & Related papers (2023-07-15T02:35:51Z)
Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels. We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z)
HDCC: A Hyperdimensional Computing compiler for classification on embedded systems and high-performance computing [58.720142291102135]
This work introduces the name compiler, the first open-source compiler that translates high-level descriptions of HDC classification methods into optimized C code. name is designed like a modern compiler, featuring an intuitive and descriptive input language, an intermediate representation (IR), and a retargetable backend. To substantiate these claims, we conducted experiments with HDCC on several of the most popular datasets in the HDC literature.
arXiv Detail & Related papers (2023-04-24T19:16:03Z)
Cramming: Training a Language Model on a Single GPU in One Day [64.18297923419627]
Recent trends in language modeling have focused on increasing performance through scaling. We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU. We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings.
arXiv Detail & Related papers (2022-12-28T18:59:28Z)
GEVO: GPU Code Optimization using Evolutionary Computation [12.9965710635562]
GEVO is a tool for discovering optimization opportunities and tuning the performance of GPU kernels in the LLVM representation. GEVO improves execution time of the GPU programs in the Rodinia benchmark suite and the machine learning models, SVM and ResNet18, on NVIDIA Tesla P100. GEVO achieves 1.79X kernel performance improvement on image classification using ResNet18/CIFAR-10, with less than 1% model accuracy reduction.
arXiv Detail & Related papers (2020-04-17T09:36:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.