Using Document Similarity Methods to create Parallel Datasets for Code
Translation
- URL: http://arxiv.org/abs/2110.05423v1
- Date: Mon, 11 Oct 2021 17:07:58 GMT
- Title: Using Document Similarity Methods to create Parallel Datasets for Code
Translation
- Authors: Mayank Agarwal, Kartik Talamadupula, Fernando Martinez, Stephanie
Houde, Michael Muller, John Richards, Steven I Ross, Justin D. Weisz
- Abstract summary: Translating source code from one programming language to another is a critical, time-consuming task.
We propose to use document similarity methods to create noisy parallel datasets of code.
We show that these models perform comparably to models trained on ground truth for reasonable levels of noise.
- Score: 60.36392618065203
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Translating source code from one programming language to another is a
critical, time-consuming task in modernizing legacy applications and codebases.
Recent work in this space has drawn inspiration from the software naturalness
hypothesis by applying natural language processing techniques towards
automating the code translation task. However, due to the paucity of parallel
data in this domain, supervised techniques have only been applied to a limited
set of popular programming languages. To bypass this limitation, unsupervised
neural machine translation techniques have been proposed to learn code
translation using only monolingual corpora. In this work, we propose to use
document similarity methods to create noisy parallel datasets of code, thus
enabling supervised techniques to be applied for automated code translation
without having to rely on the availability or expensive curation of parallel
code datasets. We explore the noise tolerance of models trained on such
automatically-created datasets and show that these models perform comparably to
models trained on ground truth for reasonable levels of noise. Finally, we
exhibit the practical utility of the proposed method by creating parallel
datasets for languages beyond the ones explored in prior work, thus expanding
the set of programming languages for automated code translation.
Related papers
- NoviCode: Generating Programs from Natural Language Utterances by Novices [59.71218039095155]
We present NoviCode, a novel NL Programming task which takes as input an API and a natural language description by a novice non-programmer.
We show that NoviCode is indeed a challenging task in the code synthesis domain, and that generating complex code from non-technical instructions goes beyond the current Text-to-Code paradigm.
arXiv Detail & Related papers (2024-07-15T11:26:03Z) - AdaCCD: Adaptive Semantic Contrasts Discovery Based Cross Lingual
Adaptation for Code Clone Detection [69.79627042058048]
AdaCCD is a novel cross-lingual adaptation method that can detect cloned codes in a new language without annotations in that language.
We evaluate the cross-lingual adaptation results of AdaCCD by constructing a multilingual code clone detection benchmark consisting of 5 programming languages.
arXiv Detail & Related papers (2023-11-13T12:20:48Z) - Semantic Parsing in Limited Resource Conditions [19.689433249830465]
The thesis explores challenges in semantic parsing, specifically focusing on scenarios with limited data and computational resources.
It offers solutions using techniques like automatic data curation, knowledge transfer, active learning, and continual learning.
arXiv Detail & Related papers (2023-09-14T05:03:09Z) - Neural Machine Translation for Code Generation [0.7607163273993514]
In NMT for code generation, the task is to generate source code that satisfies constraints expressed in the input.
In this paper we survey the NMT for code generation literature, cataloging the variety of methods that have been explored.
We discuss the limitations of existing methods and future research directions.
arXiv Detail & Related papers (2023-05-22T21:43:12Z) - Summarize and Generate to Back-translate: Unsupervised Translation of
Programming Languages [86.08359401867577]
Back-translation is widely known for its effectiveness for neural machine translation when little to no parallel data is available.
We propose performing back-translation via code summarization and generation.
We show that our proposed approach performs competitively with state-of-the-art methods.
arXiv Detail & Related papers (2022-05-23T08:20:41Z) - Zero-Shot Cross-lingual Semantic Parsing [56.95036511882921]
We study cross-lingual semantic parsing as a zero-shot problem without parallel data for 7 test languages.
We propose a multi-task encoder-decoder model to transfer parsing knowledge to additional languages using only English-Logical form paired data.
Our system frames zero-shot parsing as a latent-space alignment problem and finds that pre-trained models can be improved to generate logical forms with minimal cross-lingual transfer penalty.
arXiv Detail & Related papers (2021-04-15T16:08:43Z) - Multilingual Transfer Learning for Code-Switched Language and Speech
Neural Modeling [12.497781134446898]
We address the data scarcity and limitations of linguistic theory by proposing language-agnostic multi-task training methods.
First, we introduce a meta-learning-based approach, meta-transfer learning, in which information is judiciously extracted from high-resource monolingual speech data to the code-switching domain.
Second, we propose a novel multilingual meta-ems approach to effectively represent code-switching data by acquiring useful knowledge learned in other languages.
Third, we introduce multi-task learning to integrate syntactic information as a transfer learning strategy to a language model and learn where to code-switch.
arXiv Detail & Related papers (2021-04-13T14:49:26Z) - Word Alignment by Fine-tuning Embeddings on Parallel Corpora [96.28608163701055]
Word alignment over parallel corpora has a wide variety of applications, including learning translation lexicons, cross-lingual transfer of language processing tools, and automatic evaluation or analysis of translation outputs.
Recently, other work has demonstrated that pre-trained contextualized word embeddings derived from multilingually trained language models (LMs) prove an attractive alternative, achieving competitive results on the word alignment task even in the absence of explicit training on parallel data.
In this paper, we examine methods to marry the two approaches: leveraging pre-trained LMs but fine-tuning them on parallel text with objectives designed to improve alignment quality, and proposing
arXiv Detail & Related papers (2021-01-20T17:54:47Z) - DeepSumm -- Deep Code Summaries using Neural Transformer Architecture [8.566457170664927]
We employ neural techniques to solve the task of source code summarizing.
With supervised samples of more than 2.1m comments and code, we reduce the training time by more than 50%.
arXiv Detail & Related papers (2020-03-31T22:43:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.