Related papers: ACCeLLiuM: Supervised Fine-Tuning for Automated OpenACC Pragma Generation

ACCeLLiuM: Supervised Fine-Tuning for Automated OpenACC Pragma Generation

URL: http://arxiv.org/abs/2509.20380v2
Date: Fri, 26 Sep 2025 01:37:37 GMT
Title: ACCeLLiuM: Supervised Fine-Tuning for Automated OpenACC Pragma Generation
Authors: Samyak Jhaveri, Vanessa Klotzmann, Crista Lopes,
Abstract summary: We introduce ACCeLLiuM, two open weights Large Language Models specifically fine-tuned for generating expert OpenACC directives for data-parallel loops.<n>The ACCeLLiuM SFT dataset contains 4,033 OpenACC pragma-loop pairs mined from public GitHub C/C++, with 3,223 pairs for training and 810 for testing.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: The increasing ubiquity of GPUs is accompanied by the increasing complexity of their hardware and parallel programming frameworks. Directive-based parallel programming standards like OpenACC simplify GPU programming to some extent by abstracting away low-level complexities, but a fair amount of expertise is still required in order to use those directives effectively. We introduce ACCeLLiuM, two open weights Large Language Models specifically fine-tuned for generating expert OpenACC directives for data-parallel loops, along with the supervised fine-tuning dataset that was used to train them. The ACCeLLiuM SFT dataset contains 4,033 OpenACC pragma-loop pairs mined from public GitHub C/C++ repositories, with 3,223 pairs for training and 810 for testing. Experimental evaluations show a pronounced performance gap in generating correct OpenACC pragmas between base LLMs and our fine-tuned versions. On the held-out test set, base LLMs fail to consistently generate valid pragmas, whereas LLMs fine-tuned on the ACCeLLiuM dataset generate valid pragmas with the correct directive type for $87\%$ of the data-parallel loops, and exact pragmas--including directives, clauses, clause order, and clause variables--for $50\%$ of the cases. Even when not exact, generated pragmas frequently incorporate the correct clauses in a different order than the ground-truth label, or include additional clauses that enable finer control over parallel execution, data movement, and concurrency, offering practical value beyond strict string-matching. By publicly releasing the code, models, and dataset as ACCeLLiuM we hope to establish a reproducible benchmark for LLM-powered OpenACC pragma generation, and lower the barrier to automated GPU offloading of serially written programs.

Related papers

Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? [48.59679063480356]
Diffusion Language Models (DLMs) are often advertised as enabling parallel token generation, yet practical fast DLMs converge to left-to-right, autoregressive (AR)-like decoding dynamics.<n>We argue that a primary driver of AR-like decoding is a mismatch between DLM objectives and the highly sequential structure of widely used training data.<n>Motivated by this diagnosis, we propose NAP (Non-Autoregressive Parallel DLMs), a proof-of-concept, data-centric approach that better aligns supervision with non-AR parallel decoding.
arXiv Detail & Related papers (2026-02-26T17:04:57Z)
Bridging the Gap: Empowering Small Models in Reliable OpenACC-based Parallelization via GEPA-Optimized Prompting [0.0]
We present a systematic prompt optimization approach to enhance OpenACC pragma generation.<n>We observe an increase in compilation success rates for programs annotated with OpenACC pragma.
arXiv Detail & Related papers (2026-01-12T23:54:08Z)
Drag-and-Drop LLMs: Zero-Shot Prompt-to-Weights [75.83625828306839]
textbfDrag-and-Drop LLMs (textitDnD) eliminates per-task training by mapping a handful of unlabeled task prompts directly to LoRA weight updates.<n>A lightweight text encoder distills each prompt batch into condition embeddings, which are then transformed by a cascaded hyper-convolutional decoder into the full set of LoRA matrices.
arXiv Detail & Related papers (2025-06-19T15:38:21Z)
NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding [54.88765757043535]
This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference.<n>Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types with less than 7% computational overhead.<n>The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search.
arXiv Detail & Related papers (2025-05-28T20:43:10Z)
Can Large Language Models Predict Parallel Code Performance? [1.5221392705893568]
This paper explores whether Large Language Models (LLMs) can offer an alternative approach for GPU performance prediction without relying on hardware.<n>LLMs have a strong understanding of the Roofline model, achieving 100% classification accuracy when provided with explicit profiling data.<n>Our findings suggest that with better datasets and prompt strategies, LLMs could become practical tools for HPC roofline analysis and performance portability.
arXiv Detail & Related papers (2025-05-06T21:41:20Z)
OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs [62.68905180014956]
We introduce OpenCodeInstruct, the largest open-access instruction tuning dataset, comprising 5 million diverse samples.<n>Each sample includes a programming question, solution, test cases, execution feedback, and LLM-generated quality assessments.<n>We fine-tune various base models, including LLaMA and Qwen, across multiple scales (1B+, 3B+, and 7B+) using our dataset.
arXiv Detail & Related papers (2025-04-05T02:52:16Z)
LASSI: An LLM-based Automated Self-Correcting Pipeline for Translating Parallel Scientific Codes [0.23301643766310373]
LASSI is designed to translate between parallel programming languages by bootstrapping existing closed- or open-source LLMs.<n>LASSI incorporates autonomous enhancement through self-correcting loops where errors encountered during the compilation and execution of generated code are fed back to the LLM.
arXiv Detail & Related papers (2024-06-30T19:36:04Z)
Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [19.167604927651073]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance. We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours. Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z)
CodecLM: Aligning Language Models with Tailored Synthetic Data [51.59223474427153]
We introduce CodecLM, a framework for adaptively generating high-quality synthetic data for instruction-following abilities. We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution. We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples.
arXiv Detail & Related papers (2024-04-08T21:15:36Z)
Can LLMs Separate Instructions From Data? And What Do We Even Mean By That? [60.50127555651554]
Large Language Models (LLMs) show impressive results in numerous practical applications, but they lack essential safety features.<n>This makes them vulnerable to manipulations such as indirect prompt injections and generally unsuitable for safety-critical tasks.<n>We introduce a formal measure for instruction-data separation and an empirical variant that is calculable from a model's outputs.
arXiv Detail & Related papers (2024-03-11T15:48:56Z)
Advising OpenMP Parallelization via a Graph-Based Approach with Transformers [2.393682571484038]
We propose a novel approach, called OMPify, to detect and predict the OpenMP pragmas and shared-memory attributes in parallel code. OMPify is based on a Transformer-based model that leverages a graph-based representation of source code. Our results demonstrate that OMPify outperforms existing approaches, the general-purposed and popular ChatGPT and targeted PragFormer models.
arXiv Detail & Related papers (2023-05-16T16:56:10Z)
Learning to Parallelize in a Shared-Memory Environment with Transformers [3.340971990034025]
OpenMP is the most comprehensive API that implements shared memory parallelization schemes. Many source-to-source (S2S) compilers have been created over the years, tasked with inserting OpenMP directives into code automatically. In this work, we propose leveraging recent advances in ML techniques, specifically in natural language processing (NLP), to replace S2S compilers altogether.
arXiv Detail & Related papers (2022-04-27T10:39:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.