Enhancing Cross-Language Code Translation via Task-Specific Embedding Alignment in Retrieval-Augmented Generation
- URL: http://arxiv.org/abs/2412.05159v1
- Date: Fri, 06 Dec 2024 16:22:32 GMT
- Title: Enhancing Cross-Language Code Translation via Task-Specific Embedding Alignment in Retrieval-Augmented Generation
- Authors: Manish Bhattarai, Minh Vu, Javier E. Santos, Ismael Boureima, Daniel O' Malley,
- Abstract summary: We introduce a novel method to enhance cross-language code translation from Fortran to C++ by integrating task-specific embedding alignment.
Our strategy aligns the retrieval model directly with the objective of maximizing translation quality, as quantified by the CodeBLEU metric.
By integrating these CodeBLEU-optimized embeddings into the RAG framework, our approach significantly enhances both retrieval accuracy and code generation quality.
- Score: 1.64043572114825
- License:
- Abstract: We introduce a novel method to enhance cross-language code translation from Fortran to C++ by integrating task-specific embedding alignment into a Retrieval-Augmented Generation (RAG) framework. Unlike conventional retrieval approaches that utilize generic embeddings agnostic to the downstream task, our strategy aligns the retrieval model directly with the objective of maximizing translation quality, as quantified by the CodeBLEU metric. This alignment ensures that the embeddings are semantically and syntactically meaningful for the specific code translation task. Our methodology involves constructing a dataset of 25,000 Fortran code snippets sourced from Stack-V2 dataset and generating their corresponding C++ translations using the LLaMA 3.1-8B language model. We compute pairwise CodeBLEU scores between the generated translations and ground truth examples to capture fine-grained similarities. These scores serve as supervision signals in a contrastive learning framework, where we optimize the embedding model to retrieve Fortran-C++ pairs that are most beneficial for improving the language model's translation performance. By integrating these CodeBLEU-optimized embeddings into the RAG framework, our approach significantly enhances both retrieval accuracy and code generation quality over methods employing generic embeddings. On the HPC Fortran2C++ dataset, our method elevates the average CodeBLEU score from 0.64 to 0.73, achieving a 14% relative improvement. On the Numerical Recipes dataset, we observe an increase from 0.52 to 0.60, marking a 15% relative improvement. Importantly, these gains are realized without any fine-tuning of the language model, underscoring the efficiency and practicality of our approach.
Related papers
- Fortran2CPP: Automating Fortran-to-C++ Translation using LLMs via Multi-Turn Dialogue and Dual-Agent Integration [10.985254527043429]
Our dataset comprises 11.7k dialogues capturing feedback-decision including code translation, compilation, execution, unit testing, and error-fixing.
Using this dataset, we achieve up to a 3.31x improvement in CodeBLEU scores and a 92% increase in compilation success rate.
arXiv Detail & Related papers (2024-12-27T18:06:25Z) - CodeRosetta: Pushing the Boundaries of Unsupervised Code Translation for Parallel Programming [15.391781573025787]
We introduce CodeRosetta, an encoder-decoder model designed specifically for translating between programming languages and their HPC extensions.
CodeRosetta is evaluated on C++ to parallel C++ translation tasks.
Our results show that CodeRosetta outperforms state-of-the-art baselines in C++ to translation.
arXiv Detail & Related papers (2024-10-27T17:34:07Z) - Exploring the Impact of the Output Format on the Evaluation of Large Language Models for Code Translation [8.81447711370817]
We empirically analyze the generated outputs of eleven popular instruct-tuned large language models (LLMs)
Our results demonstrate that a strategic combination of prompt engineering and regular expression can effectively extract the source code from the model generation output.
arXiv Detail & Related papers (2024-03-25T21:41:31Z) - Cross-lingual Contextualized Phrase Retrieval [63.80154430930898]
We propose a new task formulation of dense retrieval, cross-lingual contextualized phrase retrieval.
We train our Cross-lingual Contextualized Phrase Retriever (CCPR) using contrastive learning.
On the phrase retrieval task, CCPR surpasses baselines by a significant margin, achieving a top-1 accuracy that is at least 13 points higher.
arXiv Detail & Related papers (2024-03-25T14:46:51Z) - Contextualization Distillation from Large Language Model for Knowledge
Graph Completion [51.126166442122546]
We introduce the Contextualization Distillation strategy, a plug-in-and-play approach compatible with both discriminative and generative KGC frameworks.
Our method begins by instructing large language models to transform compact, structural triplets into context-rich segments.
Comprehensive evaluations across diverse datasets and KGC techniques highlight the efficacy and adaptability of our approach.
arXiv Detail & Related papers (2024-01-28T08:56:49Z) - LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system.
We build a novel data-cleaning pipeline that uses these principles to transform existing programs.
We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z) - Creating a Dataset for High-Performance Computing Code Translation using
LLMs: A Bridge Between OpenMP Fortran and C++ [7.872005563259838]
The effectiveness of our dataset is assessed using both quantitative (CodeBLEU) and qualitative (human evaluation) methods.
Models without prior coding knowledge experienced a boost of $mathbftimes5.1$ in CodeBLEU scores.
Models with some coding familiarity saw an impressive $mathbftimes9.9$-fold increase.
arXiv Detail & Related papers (2023-07-15T02:35:51Z) - Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR)
AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities.
Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z) - Improving Generative Adversarial Networks with Local Coordinate Coding [150.24880482480455]
Generative adversarial networks (GANs) have shown remarkable success in generating realistic data from some predefined prior distribution.
In practice, semantic information might be represented by some latent distribution learned from data.
We propose an LCCGAN model with local coordinate coding (LCC) to improve the performance of generating data.
arXiv Detail & Related papers (2020-07-28T09:17:50Z) - Retrieval-Augmented Generation for Code Summarization via Hybrid GNN [23.445231228940738]
We propose a novel retrieval-augmented mechanism to combine the benefits of both worlds.
To mitigate the limitation of Graph Neural Networks (GNNs) on capturing global graph structure information of source code, we propose a novel attention-based dynamic graph.
Our method achieves the state-of-the-art performance, improving existing methods by 1.42, 2.44 and 1.29 in terms of BLEU-4, ROUGE-L and METEOR.
arXiv Detail & Related papers (2020-06-09T17:09:29Z) - Pairwise Similarity Knowledge Transfer for Weakly Supervised Object
Localization [53.99850033746663]
We study the problem of learning localization model on target classes with weakly supervised image labels.
In this work, we argue that learning only an objectness function is a weak form of knowledge transfer.
Experiments on the COCO and ILSVRC 2013 detection datasets show that the performance of the localization model improves significantly with the inclusion of pairwise similarity function.
arXiv Detail & Related papers (2020-03-18T17:53:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.