Creating a Dataset for High-Performance Computing Code Translation using
LLMs: A Bridge Between OpenMP Fortran and C++
- URL: http://arxiv.org/abs/2307.07686v4
- Date: Mon, 18 Sep 2023 18:10:37 GMT
- Title: Creating a Dataset for High-Performance Computing Code Translation using
LLMs: A Bridge Between OpenMP Fortran and C++
- Authors: Bin Lei, Caiwen Ding, Le Chen, Pei-Hung Lin, Chunhua Liao
- Abstract summary: The effectiveness of our dataset is assessed using both quantitative (CodeBLEU) and qualitative (human evaluation) methods.
Models without prior coding knowledge experienced a boost of $mathbftimes5.1$ in CodeBLEU scores.
Models with some coding familiarity saw an impressive $mathbftimes9.9$-fold increase.
- Score: 7.872005563259838
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this study, we present a novel dataset for training machine learning
models translating between OpenMP Fortran and C++ code. To ensure reliability
and applicability, the dataset is created from a range of representative
open-source OpenMP benchmarks. It is also refined using a meticulous code
similarity test. The effectiveness of our dataset is assessed using both
quantitative (CodeBLEU) and qualitative (human evaluation) methods. We showcase
how this dataset significantly elevates the translation competencies of large
language models (LLMs). Specifically, models without prior coding knowledge
experienced a boost of $\mathbf{\times~5.1}$ in their CodeBLEU scores, while
models with some coding familiarity saw an impressive
$\mathbf{\times~9.9}$-fold increase. The best fine-tuned model using our
dataset outperforms GPT-4. It is also reaching human-level accuracy. This work
underscores the immense potential of our dataset in propelling advancements in
the domain of code translation for high-performance computing. The dataset is
accessible at
\href{https://github.com/bin123apple/Fortran-CPP-HPC-code-translation-dataset}{OpenMP-Fortran-CPP-Translation}.
Related papers
- UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge.
We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process.
Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
arXiv Detail & Related papers (2025-02-17T05:37:02Z) - Transformers Boost the Performance of Decision Trees on Tabular Data across Sample Sizes [135.68092471784516]
We propose a simple and lightweight approach for fusing large language models and gradient-boosted decision trees.
We name our fusion methods LLM-Boost and PFN-Boost, respectively.
We demonstrate state-of-the-art performance against numerous baselines and ensembling algorithms.
arXiv Detail & Related papers (2025-02-04T19:30:41Z) - Fortran2CPP: Automating Fortran-to-C++ Translation using LLMs via Multi-Turn Dialogue and Dual-Agent Integration [10.985254527043429]
Our dataset comprises 11.7k dialogues capturing feedback-decision including code translation, compilation, execution, unit testing, and error-fixing.
Using this dataset, we achieve up to a 3.31x improvement in CodeBLEU scores and a 92% increase in compilation success rate.
arXiv Detail & Related papers (2024-12-27T18:06:25Z) - Enhancing Cross-Language Code Translation via Task-Specific Embedding Alignment in Retrieval-Augmented Generation [1.64043572114825]
We introduce a novel method to enhance cross-language code translation from Fortran to C++ by integrating task-specific embedding alignment.
Our strategy aligns the retrieval model directly with the objective of maximizing translation quality, as quantified by the CodeBLEU metric.
By integrating these CodeBLEU-optimized embeddings into the RAG framework, our approach significantly enhances both retrieval accuracy and code generation quality.
arXiv Detail & Related papers (2024-12-06T16:22:32Z) - CPP-UT-Bench: Can LLMs Write Complex Unit Tests in C++? [0.4915744683251149]
CPP-UT-Bench is a benchmark dataset to measure C++ unit test generation capability of a large language model (LLM)
The dataset includes 2,653 code, unit test pairs drawn from 14 different opensource C++s.
arXiv Detail & Related papers (2024-12-03T18:35:24Z) - Linear-time Minimum Bayes Risk Decoding with Reference Aggregation [52.1701152610258]
Minimum Bayes Risk (MBR) decoding is a text generation technique that has been shown to improve the quality of machine translations.
It requires the pairwise calculation of a utility metric, which has quadratic complexity.
We propose to approximate pairwise metric scores with scores calculated against aggregated reference representations.
arXiv Detail & Related papers (2024-02-06T18:59:30Z) - LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system.
We build a novel data-cleaning pipeline that uses these principles to transform existing programs.
We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z) - Generating and Imputing Tabular Data via Diffusion and Flow-based
Gradient-Boosted Trees [11.732842929815401]
Tabular data is hard to acquire and is subject to missing values.
This paper introduces a novel approach for generating and imputing mixed-type (continuous and categorical) data.
In contrast to prior methods that rely on neural networks to learn the score function or the vector field, we adopt XGBoost.
arXiv Detail & Related papers (2023-09-18T17:49:09Z) - Advising OpenMP Parallelization via a Graph-Based Approach with
Transformers [2.393682571484038]
We propose a novel approach, called OMPify, to detect and predict the OpenMP pragmas and shared-memory attributes in parallel code.
OMPify is based on a Transformer-based model that leverages a graph-based representation of source code.
Our results demonstrate that OMPify outperforms existing approaches, the general-purposed and popular ChatGPT and targeted PragFormer models.
arXiv Detail & Related papers (2023-05-16T16:56:10Z) - Interactive Code Generation via Test-Driven User-Intent Formalization [60.90035204567797]
Large language models (LLMs) produce code from informal natural language (NL) intent.
It is hard to define a notion of correctness since natural language can be ambiguous and lacks a formal semantics.
We describe a language-agnostic abstract algorithm and a concrete implementation TiCoder.
arXiv Detail & Related papers (2022-08-11T17:41:08Z) - Code to Comment "Translation": Data, Metrics, Baselining & Evaluation [49.35567240750619]
We analyze several recent code-comment datasets for this task.
We compare them with WMT19, a standard dataset frequently used to train state of the art natural language translators.
We find some interesting differences between the code-comment data and the WMT19 natural language data.
arXiv Detail & Related papers (2020-10-03T18:57:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.