Related papers: Data Augmentation for Code Translation with Comparable Corpora and Multiple References

Data Augmentation for Code Translation with Comparable Corpora and Multiple References

URL: http://arxiv.org/abs/2311.00317v2
Date: Fri, 04 Oct 2024 04:16:21 GMT
Title: Data Augmentation for Code Translation with Comparable Corpora and Multiple References
Authors: Yiqing Xie, Atharva Naik, Daniel Fried, Carolyn Rose,
Abstract summary: We build and analyze multiple types of comparable corpora, including programs generated from natural language documentation. To reduce overfitting to a single reference translation, we automatically generate additional translation references for available parallel data. Experiments show that our data augmentation techniques significantly improve CodeT5 for translation between Java, Python, and C++ by an average of 7.5% Computational Accuracy.
Score: 21.754147577489764
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: One major challenge of translating code between programming languages is that parallel training data is often limited. To overcome this challenge, we present two data augmentation techniques, one that builds comparable corpora (i.e., code pairs with similar functionality), and another that augments existing parallel data with multiple reference translations. Specifically, we build and analyze multiple types of comparable corpora, including programs generated from natural language documentation using a code generation model. Furthermore, to reduce overfitting to a single reference translation, we automatically generate additional translation references for available parallel data and filter the translations by unit tests, which increases variation in target translations. Experiments show that our data augmentation techniques significantly improve CodeT5 for translation between Java, Python, and C++ by an average of 7.5% Computational Accuracy (CA@1), which verifies the correctness of translations by execution. The code is available at https://github.com/Veronicium/CMTrans.

Related papers

Beyond Code Pairs: Dialogue-Based Data Generation for LLM Code Translation [22.50538010082899]
We present an automated dataset generation pipeline featuring a dual-LLM Questioner-r design.<n>We show this data enables a 7B open-weight model to significantly outperform larger proprietary systems on key metrics like compilation success.
arXiv Detail & Related papers (2025-11-29T05:26:53Z)
Automated Snippet-Alignment Data Augmentation for Code Translation [51.59756295898321]
parallel corpora can be categorized into program-alignment (PA) and snippet-alignment (SA) data.<n>We propose a data augmentation method that leverages LLMs to generate SA data automatically.<n>Experiments on TransCoder-test demonstrate that our augmented SA data combined with the two-stage training approach yields consistent improvements.
arXiv Detail & Related papers (2025-10-16T02:30:24Z)
IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z)
Function-to-Style Guidance of LLMs for Code Translation [59.487054943812836]
We propose F2STrans, a function-to-style guiding paradigm designed to improve the performance of large language models in code translation.<n>Our approach comprises two key stages: (1) Functional learning, which optimize translation correctness using high-quality source-target code pairs.<n>We introduce a novel code translation benchmark that includes up-to-date source code, extensive test cases, and manually annotated ground-truth translations.
arXiv Detail & Related papers (2025-07-15T08:25:02Z)
Mutual-Supervised Learning for Sequential-to-Parallel Code Translation [22.60670880322864]
We propose a novel Mutual-Supervised Learning (MSL) framework for sequential-to-parallel code translation.<n>MSL consists of two models, a Translator and a Tester.<n>We show that MuSL significantly enhances the performance of the base model.
arXiv Detail & Related papers (2025-06-11T13:50:29Z)
I Can't Share Code, but I need Translation -- An Empirical Study on Code Translation through Federated LLM [3.9373541926236766]
This study demonstrates that participants can collaboratively develop a FedLLM for efficient code translation. Our findings indicate that FedLLM offers a collaborative approach to code translation and could serve as a promising direction for future research in this field.
arXiv Detail & Related papers (2025-01-10T05:43:36Z)
Specification-Driven Code Translation Powered by Large Language Models: How Far Are We? [8.534857249221844]
We investigate using NL-specification as an intermediate representation for code translation. Our results show that using NL-specification alone does not lead to performance improvements. Besides analyzing the performance of code translation, we also investigate the quality of the translated code.
arXiv Detail & Related papers (2024-12-05T20:10:21Z)
CodeRosetta: Pushing the Boundaries of Unsupervised Code Translation for Parallel Programming [15.391781573025787]
We introduce CodeRosetta, an encoder-decoder model designed specifically for translating between programming languages and their HPC extensions. CodeRosetta is evaluated on C++ to parallel C++ translation tasks. Our results show that CodeRosetta outperforms state-of-the-art baselines in C++ to translation.
arXiv Detail & Related papers (2024-10-27T17:34:07Z)
AdaCCD: Adaptive Semantic Contrasts Discovery Based Cross Lingual Adaptation for Code Clone Detection [69.79627042058048]
AdaCCD is a novel cross-lingual adaptation method that can detect cloned codes in a new language without annotations in that language. We evaluate the cross-lingual adaptation results of AdaCCD by constructing a multilingual code clone detection benchmark consisting of 5 programming languages.
arXiv Detail & Related papers (2023-11-13T12:20:48Z)
Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine Translation of Lecture Transcripts [50.00305136008848]
We propose a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera. For both English--Japanese and English--Chinese lecture translations, we extracted parallel corpora of approximately 50,000 lines and created development and test sets. This study also suggests guidelines for gathering and cleaning corpora, mining parallel sentences, cleaning noise in the mined data, and creating high-quality evaluation splits.
arXiv Detail & Related papers (2023-11-07T03:50:25Z)
Program Translation via Code Distillation [20.668229308907495]
Traditional machine translation relies on parallel corpora for supervised translation. Recent unsupervised neural machine translation techniques have overcome data limitations. We propose a novel model called Code Distillation (CoDist)
arXiv Detail & Related papers (2023-10-17T04:59:15Z)
CodeTransOcean: A Comprehensive Multilingual Benchmark for Code Translation [8.979765541978292]
CodeTransOcean is a large-scale comprehensive dataset that supports the largest variety of programming languages for code translation. CodeTransOcean consists of three novel multilingual datasets, namely, MultilingualTrans supporting translations between multiple popular programming languages, NicheTrans for translating between niche programming languages and popular ones, and LLMTrans for evaluating executability of translated code by large language models (LLMs)
arXiv Detail & Related papers (2023-10-08T00:16:18Z)
The Effect of Alignment Objectives on Code-Switching Translation [0.0]
We are proposing a way of training a single machine translation model that is able to translate monolingual sentences from one language to another. This model can be considered a bilingual model in the human sense.
arXiv Detail & Related papers (2023-09-10T14:46:31Z)
Original or Translated? On the Use of Parallel Data for Translation Quality Estimation [81.27850245734015]
We demonstrate a significant gap between parallel data and real QE data. For parallel data, it is indiscriminate and the translationese may occur on either source or target side. We find that using the source-original part of parallel corpus consistently outperforms its target-original counterpart.
arXiv Detail & Related papers (2022-12-20T14:06:45Z)
Using Document Similarity Methods to create Parallel Datasets for Code Translation [60.36392618065203]
Translating source code from one programming language to another is a critical, time-consuming task. We propose to use document similarity methods to create noisy parallel datasets of code. We show that these models perform comparably to models trained on ground truth for reasonable levels of noise.
arXiv Detail & Related papers (2021-10-11T17:07:58Z)
Meta Back-translation [111.87397401837286]
We propose a novel method to generate pseudo-parallel data from a pre-trained back-translation model. Our method is a meta-learning algorithm which adapts a pre-trained back-translation model so that the pseudo-parallel data it generates would train a forward-translation model to do well on a validation set.
arXiv Detail & Related papers (2021-02-15T20:58:32Z)
Code to Comment "Translation": Data, Metrics, Baselining & Evaluation [49.35567240750619]
We analyze several recent code-comment datasets for this task. We compare them with WMT19, a standard dataset frequently used to train state of the art natural language translators. We find some interesting differences between the code-comment data and the WMT19 natural language data.
arXiv Detail & Related papers (2020-10-03T18:57:26Z)
Unsupervised Translation of Programming Languages [19.56070393390029]
A transcompiler, also known as source-to-source, is a system that converts source code from a high-level programming language to another. We train our model on source code from open source GitHub projects, and show that it can translate functions between C++, Java, and Python with high accuracy.
arXiv Detail & Related papers (2020-06-05T15:28:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.