Creating a Dataset for High-Performance Computing Code Translation using
LLMs: A Bridge Between OpenMP Fortran and C++
- URL: http://arxiv.org/abs/2307.07686v4
- Date: Mon, 18 Sep 2023 18:10:37 GMT
- Title: Creating a Dataset for High-Performance Computing Code Translation using
LLMs: A Bridge Between OpenMP Fortran and C++
- Authors: Bin Lei, Caiwen Ding, Le Chen, Pei-Hung Lin, Chunhua Liao
- Abstract summary: The effectiveness of our dataset is assessed using both quantitative (CodeBLEU) and qualitative (human evaluation) methods.
Models without prior coding knowledge experienced a boost of $mathbftimes5.1$ in CodeBLEU scores.
Models with some coding familiarity saw an impressive $mathbftimes9.9$-fold increase.
- Score: 7.872005563259838
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this study, we present a novel dataset for training machine learning
models translating between OpenMP Fortran and C++ code. To ensure reliability
and applicability, the dataset is created from a range of representative
open-source OpenMP benchmarks. It is also refined using a meticulous code
similarity test. The effectiveness of our dataset is assessed using both
quantitative (CodeBLEU) and qualitative (human evaluation) methods. We showcase
how this dataset significantly elevates the translation competencies of large
language models (LLMs). Specifically, models without prior coding knowledge
experienced a boost of $\mathbf{\times~5.1}$ in their CodeBLEU scores, while
models with some coding familiarity saw an impressive
$\mathbf{\times~9.9}$-fold increase. The best fine-tuned model using our
dataset outperforms GPT-4. It is also reaching human-level accuracy. This work
underscores the immense potential of our dataset in propelling advancements in
the domain of code translation for high-performance computing. The dataset is
accessible at
\href{https://github.com/bin123apple/Fortran-CPP-HPC-code-translation-dataset}{OpenMP-Fortran-CPP-Translation}.
Related papers
- CodeShell Technical Report [23.741490720927068]
We present CodeShell-Base, a foundation model with 8K context length, showcasing exceptional proficiency in code comprehension.
We have curated 100 billion high-quality pre-training data from GitHub.
Benefiting from the high-quality data, CodeShell-Base outperforms CodeLlama in Humaneval after training on just 500 billion tokens (5 epochs)
arXiv Detail & Related papers (2024-03-23T07:29:41Z) - Linear-time Minimum Bayes Risk Decoding with Reference Aggregation [52.1701152610258]
Minimum Bayes Risk (MBR) decoding is a text generation technique that has been shown to improve the quality of machine translations.
It requires the pairwise calculation of a utility metric, which has quadratic complexity.
We propose to approximate pairwise metric scores with scores calculated against aggregated reference representations.
arXiv Detail & Related papers (2024-02-06T18:59:30Z) - LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system.
We build a novel data-cleaning pipeline that uses these principles to transform existing programs.
We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z) - Leveraging Generative AI: Improving Software Metadata Classification
with Generated Code-Comment Pairs [0.0]
In software development, code comments play a crucial role in enhancing code comprehension and collaboration.
This research paper addresses the challenge of objectively classifying code comments as "Useful" or "Not Useful"
We propose a novel solution that harnesses contextualized embeddings, particularly BERT, to automate this classification process.
arXiv Detail & Related papers (2023-10-14T12:09:43Z) - Generating and Imputing Tabular Data via Diffusion and Flow-based
Gradient-Boosted Trees [11.732842929815401]
Tabular data is hard to acquire and is subject to missing values.
This paper introduces a novel approach for generating and imputing mixed-type (continuous and categorical) data.
In contrast to prior methods that rely on neural networks to learn the score function or the vector field, we adopt XGBoost.
arXiv Detail & Related papers (2023-09-18T17:49:09Z) - Exploring Continual Learning for Code Generation Models [80.78036093054855]
Continual Learning (CL) is an important aspect that remains underexplored in the code domain.
We introduce a benchmark called CodeTask-CL that covers a wide range of tasks, including code generation, translation, summarization, and refinement.
We find that effective methods like Prompt Pooling (PP) suffer from catastrophic forgetting due to the unstable training of the prompt selection mechanism.
arXiv Detail & Related papers (2023-07-05T16:58:39Z) - Advising OpenMP Parallelization via a Graph-Based Approach with
Transformers [2.393682571484038]
We propose a novel approach, called OMPify, to detect and predict the OpenMP pragmas and shared-memory attributes in parallel code.
OMPify is based on a Transformer-based model that leverages a graph-based representation of source code.
Our results demonstrate that OMPify outperforms existing approaches, the general-purposed and popular ChatGPT and targeted PragFormer models.
arXiv Detail & Related papers (2023-05-16T16:56:10Z) - Interactive Code Generation via Test-Driven User-Intent Formalization [60.90035204567797]
Large language models (LLMs) produce code from informal natural language (NL) intent.
It is hard to define a notion of correctness since natural language can be ambiguous and lacks a formal semantics.
We describe a language-agnostic abstract algorithm and a concrete implementation TiCoder.
arXiv Detail & Related papers (2022-08-11T17:41:08Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - CoDesc: A Large Code-Description Parallel Dataset [4.828053113572208]
We present CoDesc -- a large parallel dataset composed of 4.2 million Java methods and natural language descriptions.
With extensive analysis, we identify and remove prevailing noise patterns from the dataset.
We show that the dataset helps improve code search by up to 22% and achieves the new state-of-the-art in code summarization.
arXiv Detail & Related papers (2021-05-29T05:40:08Z) - Code to Comment "Translation": Data, Metrics, Baselining & Evaluation [49.35567240750619]
We analyze several recent code-comment datasets for this task.
We compare them with WMT19, a standard dataset frequently used to train state of the art natural language translators.
We find some interesting differences between the code-comment data and the WMT19 natural language data.
arXiv Detail & Related papers (2020-10-03T18:57:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.