Knowledge Transfer for Pseudo-code Generation from Low Resource
Programming Language
- URL: http://arxiv.org/abs/2303.09062v1
- Date: Thu, 16 Mar 2023 03:38:08 GMT
- Title: Knowledge Transfer for Pseudo-code Generation from Low Resource
Programming Language
- Authors: Ankita Sontakke, Kanika Kalra, Manasi Patwardhan, Lovekesh Vig,
Raveendra Kumar Medicherla, Ravindra Naik, Shrishti Pradhan
- Abstract summary: We focus on transferring the knowledge acquired by the code-to-pseudocode neural model trained on a high resource PL (C++) using parallel code-pseudocode data.
We observe an improvement of 23.27% in the success rate of the generated C codes through back translation.
- Score: 13.716669765394293
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Generation of pseudo-code descriptions of legacy source code for software
maintenance is a manually intensive task. Recent encoder-decoder language
models have shown promise for automating pseudo-code generation for high
resource programming languages such as C++, but are heavily reliant on the
availability of a large code-pseudocode corpus. Soliciting such pseudocode
annotations for codes written in legacy programming languages (PL) is a time
consuming and costly affair requiring a thorough understanding of the source
PL. In this paper, we focus on transferring the knowledge acquired by the
code-to-pseudocode neural model trained on a high resource PL (C++) using
parallel code-pseudocode data. We aim to transfer this knowledge to a legacy PL
(C) with no PL-pseudocode parallel data for training. To achieve this, we
utilize an Iterative Back Translation (IBT) approach with a novel test-cases
based filtration strategy, to adapt the trained C++-to-pseudocode model to
C-to-pseudocode model. We observe an improvement of 23.27% in the success rate
of the generated C codes through back translation, over the successive IBT
iteration, illustrating the efficacy of our approach.
Related papers
- Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs [57.27982780697922]
Large language models have demonstrated exceptional capability in natural language understanding and generation.
However, their generation speed is limited by the inherently sequential nature of their decoding process.
This paper introduces Lexical Unit Decoding, a novel decoding methodology implemented in a data-driven manner.
arXiv Detail & Related papers (2024-05-24T04:35:13Z) - CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
We propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs.
CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language.
Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation.
CodeIP is a novel multi-bit watermarking technique that embeds additional information to preserve provenance details.
Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z) - SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code
Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects.
We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z) - StructCoder: Structure-Aware Transformer for Code Generation [13.797842927671846]
We introduce a structure-aware Transformer decoder that models both syntax and data flow to enhance the quality of generated code.
The proposed StructCoder model achieves state-of-the-art performance on code translation and text-to-code generation tasks.
arXiv Detail & Related papers (2022-06-10T17:26:31Z) - ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval.
We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z) - CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for
Code Understanding and Generation [36.47905744758698]
We present CodeT5, a unified pre-trained encoder-decoder Transformer model that better leverages the code semantics conveyed from the developer-assigned identifiers.
Our model employs a unified framework to seamlessly support both code understanding and generation tasks and allows for multi-task learning.
arXiv Detail & Related papers (2021-09-02T12:21:06Z) - Project-Level Encoding for Neural Source Code Summarization of
Subroutines [6.939768185086755]
We present a project-level encoder to improve models of code summarization.
We use that representation to augment the encoder of state-of-the-art neural code summarization techniques.
arXiv Detail & Related papers (2021-03-22T06:01:07Z) - InferCode: Self-Supervised Learning of Code Representations by
Predicting Subtrees [17.461451218469062]
This paper proposes InferCode to overcome the limitation by adapting the self-language learning mechanism to build source code model.
Subtrees in ASTs are treated with InferCode as the labels for training code representations without any human labeling effort or the overhead of expensive graph construction.
Compared to previous code learning techniques applied to the same downstream tasks, such as Code2Vec, Code2Seq, ASTNN, higher performance results are achieved using our pre-trained InferCode model.
arXiv Detail & Related papers (2020-12-13T10:33:41Z) - Incorporating External Knowledge through Pre-training for Natural
Language to Code Generation [97.97049697457425]
Open-domain code generation aims to generate code in a general-purpose programming language from natural language (NL) intents.
We explore the effectiveness of incorporating two varieties of external knowledge into NL-to-code generation: automatically mined NL-code pairs from the online programming QA forum StackOverflow and programming language API documentation.
Our evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2% absolute BLEU score on the code generation testbed CoNaLa.
arXiv Detail & Related papers (2020-04-20T01:45:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.