Enhancing Code Translation in Language Models with Few-Shot Learning via Retrieval-Augmented Generation
- URL: http://arxiv.org/abs/2407.19619v1
- Date: Mon, 29 Jul 2024 00:41:48 GMT
- Title: Enhancing Code Translation in Language Models with Few-Shot Learning via Retrieval-Augmented Generation
- Authors: Manish Bhattarai, Javier E. Santos, Shawn Jones, Ayan Biswas, Boian Alexandrov, Daniel O'Malley,
- Abstract summary: This paper introduces a novel approach that enhances code translation through Few-Shot Learning.
By leveraging a repository of existing code translations, we dynamically retrieve the most relevant examples to guide the model in translating new code segments.
Our method, based on Retrieval-Augmented Generation, substantially improves translation quality.
- Score: 1.9726019592585404
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The advent of large language models (LLMs) has significantly advanced the field of code translation, enabling automated translation between programming languages. However, these models often struggle with complex translation tasks due to inadequate contextual understanding. This paper introduces a novel approach that enhances code translation through Few-Shot Learning, augmented with retrieval-based techniques. By leveraging a repository of existing code translations, we dynamically retrieve the most relevant examples to guide the model in translating new code segments. Our method, based on Retrieval-Augmented Generation (RAG), substantially improves translation quality by providing contextual examples from which the model can learn in real-time. We selected RAG over traditional fine-tuning methods due to its ability to utilize existing codebases or a locally stored corpus of code, which allows for dynamic adaptation to diverse translation tasks without extensive retraining. Extensive experiments on diverse datasets with open LLM models such as Starcoder, Llama3-70B Instruct, CodeLlama-34B Instruct, Granite-34B Code Instruct, and Mixtral-8x22B, as well as commercial LLM models like GPT-3.5 Turbo and GPT-4o, demonstrate our approach's superiority over traditional zero-shot methods, especially in translating between Fortran and CPP. We also explored varying numbers of shots i.e. examples provided during inference, specifically 1, 2, and 3 shots and different embedding models for RAG, including Nomic-Embed, Starencoder, and CodeBERT, to assess the robustness and effectiveness of our approach.
Related papers
- A Case Study on Context-Aware Neural Machine Translation with Multi-Task Learning [49.62044186504516]
In document-level neural machine translation (DocNMT), multi-encoder approaches are common in encoding context and source sentences.
Recent studies have shown that the context encoder generates noise and makes the model robust to the choice of context.
This paper further investigates this observation by explicitly modelling context encoding through multi-task learning (MTL) to make the model sensitive to the choice of context.
arXiv Detail & Related papers (2024-07-03T12:50:49Z) - Investigating the translation capabilities of Large Language Models trained on parallel data only [1.5974665548135587]
Large Language Models (LLMs) have demonstrated exceptional proficiency across a broad spectrum of Natural Language Processing (NLP) tasks.
We introduce PLUME, a collection of three 2B LLMs featuring varying vocabulary sizes (32k, 128k, and 256k) trained exclusively on Catalan-centric parallel examples.
These models perform comparably to previous encoder-decoder architectures on 16 supervised translation directions and 56 zero-shot ones.
arXiv Detail & Related papers (2024-06-13T14:08:56Z) - Exploring the Impact of the Output Format on the Evaluation of Large Language Models for Code Translation [8.81447711370817]
We empirically analyze the generated outputs of eleven popular instruct-tuned large language models (LLMs)
Our results demonstrate that a strategic combination of prompt engineering and regular expression can effectively extract the source code from the model generation output.
arXiv Detail & Related papers (2024-03-25T21:41:31Z) - IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators [49.903001442804594]
This work investigates the prospect of leveraging compiler intermediate representations (IR) to improve the multilingual capabilities of Code-LMs.
We first compile SLTrans, a parallel dataset consisting of nearly 4M self-contained source code files.
Next, we carry out continued causal language modelling training on SLTrans, forcing the Code-LMs to learn the IR language.
Our resulting models, dubbed IRCoder, display sizeable and consistent gains across a wide variety of code generation tasks and metrics.
arXiv Detail & Related papers (2024-03-06T17:52:08Z) - Exploring Large Language Models for Code Explanation [3.2570216147409514]
Large Language Models (LLMs) have made remarkable strides in Natural Language Processing.
This study specifically delves into the task of generating natural-language summaries for code snippets, using various LLMs.
arXiv Detail & Related papers (2023-10-25T14:38:40Z) - Large Language Models can Contrastively Refine their Generation for Better Sentence Representation Learning [57.74233319453229]
Large language models (LLMs) have emerged as a groundbreaking technology and their unparalleled text generation capabilities have sparked interest in their application to the fundamental sentence representation learning task.
We propose MultiCSR, a multi-level contrastive sentence representation learning framework that decomposes the process of prompting LLMs to generate a corpus.
Our experiments reveal that MultiCSR enables a less advanced LLM to surpass the performance of ChatGPT, while applying it to ChatGPT achieves better state-of-the-art results.
arXiv Detail & Related papers (2023-10-17T03:21:43Z) - CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model [58.127534002232096]
This paper introduces CodeFuse-13B, an open-sourced pre-trained code LLM.
It is specifically designed for code-related tasks with both English and Chinese prompts.
CodeFuse achieves its effectiveness by utilizing a high quality pre-training dataset.
arXiv Detail & Related papers (2023-10-10T02:38:44Z) - TIM: Teaching Large Language Models to Translate with Comparison [78.66926087162672]
We propose a novel framework using examples in comparison to teach LLMs to learn translation.
Our approach involves presenting the model with examples of correct and incorrect translations and using a preference loss to guide the model's learning.
Our findings offer a new perspective on fine-tuning LLMs for translation tasks and provide a promising solution for generating high-quality translations.
arXiv Detail & Related papers (2023-07-10T08:15:40Z) - Extrapolating Multilingual Understanding Models as Multilingual
Generators [82.1355802012414]
This paper explores methods to empower multilingual understanding models the generation abilities to get a unified model.
We propose a textbfSemantic-textbfGuided textbfAlignment-then-Denoising (SGA) approach to adapt an encoder to a multilingual generator with a small number of new parameters.
arXiv Detail & Related papers (2023-05-22T15:33:21Z) - Few-shot Knowledge Graph-to-Text Generation with Pretrained Language
Models [42.38563175680914]
This paper studies how to automatically generate a natural language text that describes the facts in knowledge graph (KG)
Considering the few-shot setting, we leverage the excellent capacities of pretrained language models (PLMs) in language understanding and generation.
arXiv Detail & Related papers (2021-06-03T06:48:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.