Related papers: Bridging the Gap: Empowering Small Models in Reliable OpenACC-based Parallelization via GEPA-Optimized Prompting

Bridging the Gap: Empowering Small Models in Reliable OpenACC-based Parallelization via GEPA-Optimized Prompting

URL: http://arxiv.org/abs/2601.08884v1
Date: Mon, 12 Jan 2026 23:54:08 GMT
Title: Bridging the Gap: Empowering Small Models in Reliable OpenACC-based Parallelization via GEPA-Optimized Prompting
Authors: Samyak Jhaveri, Cristina V. Lopes,
Abstract summary: We present a systematic prompt optimization approach to enhance OpenACC pragma generation.<n>We observe an increase in compilation success rates for programs annotated with OpenACC pragma.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: OpenACC lowers the barrier to GPU offloading, but writing high-performing pragma remains complex, requiring deep domain expertise in memory hierarchies, data movement, and parallelization strategies. Large Language Models (LLMs) present a promising potential solution for automated parallel code generation, but naive prompting often results in syntactically incorrect directives, uncompilable code, or performance that fails to exceed CPU baselines. We present a systematic prompt optimization approach to enhance OpenACC pragma generation without the prohibitive computational costs associated with model post-training. Leveraging the GEPA (GEnetic-PAreto) framework, we iteratively evolve prompts through a reflective feedback loop. This process utilizes crossover and mutation of instructions, guided by expert-curated gold examples and structured feedback based on clause- and clause parameter-level mismatches between the gold and predicted pragma. In our evaluation on the PolyBench suite, we observe an increase in compilation success rates for programs annotated with OpenACC pragma generated using the optimized prompts compared to those annotated using the simpler initial prompt, particularly for the "nano"-scale models. Specifically, with optimized prompts, the compilation success rate for GPT-4.1 Nano surged from 66.7% to 93.3%, and for GPT-5 Nano improved from 86.7% to 100%, matching or surpassing the capabilities of their significantly larger, more expensive versions. Beyond compilation, the optimized prompts resulted in a 21% increase in the number of programs that achieve functional GPU speedups over CPU baselines. These results demonstrate that prompt optimization effectively unlocks the potential of smaller, cheaper LLMs in writing stable and effective GPU-offloading directives, establishing a cost-effective pathway to automated directive-based parallelization in HPC workflows.

Related papers

FastUSP: A Multi-Level Collaborative Acceleration Framework for Distributed Diffusion Model Inference [11.772150619675527]
Unified Sequence Parallelism (USP) has emerged as the state-of-the-art approach for distributed attention computation.<n>Existing USP implementations suffer from excessive kernel launch overhead and suboptimal-communication scheduling.<n>We propose textbfFastUSP, a framework that integrates compile-level optimization, communication-level optimization, and operator-level optimization.
arXiv Detail & Related papers (2026-02-11T15:19:57Z)
GFlowPO: Generative Flow Network as a Language Model Prompt Optimizer [51.31263673158136]
GFlowPO casts prompt search as a posterior inference problem over latent prompts regularized by a meta-prompted reference-LM prior.<n>GFlowPO consistently outperforms recent discrete prompt optimization baselines.
arXiv Detail & Related papers (2026-02-03T10:30:03Z)
A Two-Stage GPU Kernel Tuner Combining Semantic Refactoring and Search-Based Optimization [9.49293344824955]
This paper introduces a template-based rewriting layer on top of an agent-driven iterative loop.<n>The proposed method can be extended to deliver automated performance optimization for real production workloads.
arXiv Detail & Related papers (2026-01-19T03:40:12Z)
An LLVM-Based Optimization Pipeline for SPDZ [0.0]
We implement a proof-of-concept LLVM-based optimization pipeline for the SPDZ protocol.<n>Our front end accepts a subset of C with lightweight privacy annotations and lowers it to LLVM IR.<n>Our back end performs data-flow and control-flow analysis on the optimized IR to drive a non-blocking runtime scheduler.
arXiv Detail & Related papers (2025-12-11T20:53:35Z)
Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z)
dParallel: Learnable Parallel Decoding for dLLMs [77.24184219948337]
Diffusion large language models (dLLMs) offer parallel token prediction and lower inference latency.<n>Existing open-source models still require nearly token-length decoding steps to ensure performance.<n>We introduce dParallel, a simple and effective method that unlocks the inherent parallelism of dLLMs for fast sampling.
arXiv Detail & Related papers (2025-09-30T16:32:52Z)
GPU-Accelerated Loopy Belief Propagation for Program Analysis [3.516434517865342]
This paper presents a GPU-accelerated LBP algorithm for program analysis.<n>We propose a unified representation for specifying arbitrary user-defined update strategies, along with a dependency analysis algorithm.<n>Our approach achieves an average speedup of $2.14times$ over the state-of-the-art sequential approach and $5.56times$ over the state-of-the-art GPU-based approach.
arXiv Detail & Related papers (2025-09-26T13:30:30Z)
ACCeLLiuM: Supervised Fine-Tuning for Automated OpenACC Pragma Generation [0.0]
We introduce ACCeLLiuM, two open weights Large Language Models specifically fine-tuned for generating expert OpenACC directives for data-parallel loops.<n>The ACCeLLiuM SFT dataset contains 4,033 OpenACC pragma-loop pairs mined from public GitHub C/C++, with 3,223 pairs for training and 810 for testing.
arXiv Detail & Related papers (2025-09-20T20:41:32Z)
Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models? [65.18157595903124]
This work investigates iterative approximate evaluation for arbitrary prompts.<n>It introduces Model Predictive Prompt Selection (MoPPS), a Bayesian risk-predictive framework.<n>MoPPS reliably predicts prompt difficulty and accelerates training with significantly reduced rollouts.
arXiv Detail & Related papers (2025-07-07T03:20:52Z)
NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding [54.88765757043535]
This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference.<n>Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types with less than 7% computational overhead.<n>The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search.
arXiv Detail & Related papers (2025-05-28T20:43:10Z)
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes.<n>Our results demonstrate at most 52.4% improvement in prefill throughput compared to existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z)
Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [23.633481089469836]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance.<n>We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours.<n>Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z)
Unleashing the Potential of Large Language Models as Prompt Optimizers: Analogical Analysis with Gradient-based Model Optimizers [108.72225067368592]
We propose a novel perspective to investigate the design of large language models (LLMs)-based prompts.<n>We identify two pivotal factors in model parameter learning: update direction and update method.<n>We develop a capable Gradient-inspired Prompt-based GPO.
arXiv Detail & Related papers (2024-02-27T15:05:32Z)
ParaGraph: Weighted Graph Representation for Performance Optimization of HPC Kernels [1.304892050913381]
We introduce a new graph-based program representation for parallel applications that extends the Abstract Syntax Tree. We evaluate our proposed representation by training a Graph Neural Network (GNN) to predict the runtime of an OpenMP code region. Results show that our approach is indeed effective and has normalized RMSE as low as 0.004 to at most 0.01 in its runtime predictions.
arXiv Detail & Related papers (2023-04-07T05:52:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.