CodeRosetta: Pushing the Boundaries of Unsupervised Code Translation for Parallel Programming
- URL: http://arxiv.org/abs/2410.20527v1
- Date: Sun, 27 Oct 2024 17:34:07 GMT
- Title: CodeRosetta: Pushing the Boundaries of Unsupervised Code Translation for Parallel Programming
- Authors: Ali TehraniJamsaz, Arijit Bhattacharjee, Le Chen, Nesreen K. Ahmed, Amir Yazdanbakhsh, Ali Jannesari,
- Abstract summary: We introduce CodeRosetta, an encoder-decoder model designed specifically for translating between programming languages and their HPC extensions.
CodeRosetta is evaluated on C++ to parallel C++ translation tasks.
Our results show that CodeRosetta outperforms state-of-the-art baselines in C++ to translation.
- Score: 15.391781573025787
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in Large Language Models (LLMs) have renewed interest in automatic programming language translation. Encoder-decoder transformer models, in particular, have shown promise in translating between different programming languages. However, translating between a language and its high-performance computing (HPC) extensions remains underexplored due to challenges such as complex parallel semantics. In this paper, we introduce CodeRosetta, an encoder-decoder transformer model designed specifically for translating between programming languages and their HPC extensions. CodeRosetta is evaluated on C++ to CUDA and Fortran to C++ translation tasks. It uses a customized learning framework with tailored pretraining and training objectives to effectively capture both code semantics and parallel structural nuances, enabling bidirectional translation. Our results show that CodeRosetta outperforms state-of-the-art baselines in C++ to CUDA translation by 2.9 BLEU and 1.72 CodeBLEU points while improving compilation accuracy by 6.05%. Compared to general closed-source LLMs, our method improves C++ to CUDA translation by 22.08 BLEU and 14.39 CodeBLEU, with 2.75% higher compilation accuracy. Finally, CodeRosetta exhibits proficiency in Fortran to parallel C++ translation, marking it, to our knowledge, as the first encoder-decoder model for this complex task, improving CodeBLEU by at least 4.63 points compared to closed-source and open-code LLMs.
Related papers
- LLM-Assisted Translation of Legacy FORTRAN Codes to C++: A Cross-Platform Study [38.73914653312889]
Large Language Models (LLMs) are increasingly being leveraged for generating and translating scientific computer codes.
Here, we studied the applicability of LLM-based translation of Fortran to C++ as a step towards building an agentic-workflow.
We statistically quantified the compilation accuracy of the translated C++ codes, measured the similarity of the LLM translated code to the human translated C++ code, and statistically quantified the output similarity of the Fortran to C++ translation.
arXiv Detail & Related papers (2025-04-21T20:34:37Z) - ObscuraCoder: Powering Efficient Code LM Pre-Training Via Obfuscation Grounding [60.37988508851391]
Language models (LMs) have become a staple of the code-writing toolbox.
Research exploring modifications to Code-LMs' pre-training objectives, geared towards improving data efficiency and better disentangling between syntax and semantics, has been noticeably sparse.
In this work, we examine grounding on obfuscated code as a means of helping Code-LMs look beyond the surface-form syntax and enhance their pre-training sample efficiency.
arXiv Detail & Related papers (2025-03-27T23:08:53Z) - Fortran2CPP: Automating Fortran-to-C++ Translation using LLMs via Multi-Turn Dialogue and Dual-Agent Integration [10.985254527043429]
Our dataset comprises 11.7k dialogues capturing feedback-decision including code translation, compilation, execution, unit testing, and error-fixing.
Using this dataset, we achieve up to a 3.31x improvement in CodeBLEU scores and a 92% increase in compilation success rate.
arXiv Detail & Related papers (2024-12-27T18:06:25Z) - Enhancing Cross-Language Code Translation via Task-Specific Embedding Alignment in Retrieval-Augmented Generation [1.64043572114825]
We introduce a novel method to enhance cross-language code translation from Fortran to C++ by integrating task-specific embedding alignment.
Our strategy aligns the retrieval model directly with the objective of maximizing translation quality, as quantified by the CodeBLEU metric.
By integrating these CodeBLEU-optimized embeddings into the RAG framework, our approach significantly enhances both retrieval accuracy and code generation quality.
arXiv Detail & Related papers (2024-12-06T16:22:32Z) - Leveraging Large Language Models for Code Translation and Software Development in Scientific Computing [0.9668407688201359]
generative artificial intelligence (GenAI) is poised to transform productivity in scientific computing.
We developed a tool, CodeScribe, which combines prompt engineering with user supervision to establish an efficient process for code conversion.
We also address the challenges of AI-driven code translation and highlight its benefits for enhancing productivity in scientific computing.
arXiv Detail & Related papers (2024-10-31T16:48:41Z) - CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution [50.7413285637879]
The CRUXEVAL-X code reasoning benchmark contains 19 programming languages.
It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total.
Even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages.
arXiv Detail & Related papers (2024-08-23T11:43:00Z) - CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
We propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs.
CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language.
Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators [49.903001442804594]
This work investigates the prospect of leveraging compiler intermediate representations (IR) to improve the multilingual capabilities of Code-LMs.
We first compile SLTrans, a parallel dataset consisting of nearly 4M self-contained source code files.
Next, we carry out continued causal language modelling training on SLTrans, forcing the Code-LMs to learn the IR language.
Our resulting models, dubbed IRCoder, display sizeable and consistent gains across a wide variety of code generation tasks and metrics.
arXiv Detail & Related papers (2024-03-06T17:52:08Z) - Converting Epics/Stories into Pseudocode using Transformers [0.0]
Pseudocode is a programming language representation of the steps involved in a computer program.
We present a methodology to convert a problem described in the English language into pseudocode.
We find that the CodeT5 model gives the best results in terms of BLEU score when trained separately on the two subtasks mentioned above.
arXiv Detail & Related papers (2023-12-08T14:01:09Z) - Data Augmentation for Code Translation with Comparable Corpora and Multiple References [21.754147577489764]
We build and analyze multiple types of comparable corpora, including programs generated from natural language documentation.
To reduce overfitting to a single reference translation, we automatically generate additional translation references for available parallel data.
Experiments show that our data augmentation techniques significantly improve CodeT5 for translation between Java, Python, and C++ by an average of 7.5% Computational Accuracy.
arXiv Detail & Related papers (2023-11-01T06:01:22Z) - Guess & Sketch: Language Model Guided Transpilation [59.02147255276078]
Learned transpilation offers an alternative to manual re-writing and engineering efforts.
Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness.
Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence.
arXiv Detail & Related papers (2023-09-25T15:42:18Z) - Exploring Continual Learning for Code Generation Models [80.78036093054855]
Continual Learning (CL) is an important aspect that remains underexplored in the code domain.
We introduce a benchmark called CodeTask-CL that covers a wide range of tasks, including code generation, translation, summarization, and refinement.
We find that effective methods like Prompt Pooling (PP) suffer from catastrophic forgetting due to the unstable training of the prompt selection mechanism.
arXiv Detail & Related papers (2023-07-05T16:58:39Z) - Knowledge Transfer for Pseudo-code Generation from Low Resource
Programming Language [13.716669765394293]
We focus on transferring the knowledge acquired by the code-to-pseudocode neural model trained on a high resource PL (C++) using parallel code-pseudocode data.
We observe an improvement of 23.27% in the success rate of the generated C codes through back translation.
arXiv Detail & Related papers (2023-03-16T03:38:08Z) - Code Translation with Compiler Representations [21.702473137941006]
Traditional transpilers rely on syntactic information and handcrafted rules, which limits their applicability and produces unnatural-looking code.
Applying neural machine translation (NMT) approaches to code has successfully broadened the set of programs on which one can get a natural-looking translation.
Here we propose to augment code translation with IRs, specifically LLVM IR, with results on the C++, Java, Rust, and Go languages.
arXiv Detail & Related papers (2022-06-30T14:21:57Z) - ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval.
We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.