Neuro-symbolic Zero-Shot Code Cloning with Cross-Language Intermediate
Representation
- URL: http://arxiv.org/abs/2304.13350v1
- Date: Wed, 26 Apr 2023 07:41:26 GMT
- Title: Neuro-symbolic Zero-Shot Code Cloning with Cross-Language Intermediate
Representation
- Authors: Krishnam Hasija, Shrishti Pradhan, Manasi Patwardhan, Raveendra Kumar
Medicherla, Lovekesh Vig, Ravindra Naik
- Abstract summary: We define a neuro-symbolic approach to address the task of finding semantically similar clones for the codes of the legacy programming language, without training data.
We fine-tune UnixCoder, the best-performing model for cross-programming language search, for the Code Cloning task with the SBT IRs of C code-pairs, available in the CodeNet dataset.
With this fine-tuned UnixCoder, we get a performance improvement of 12.85 MAP@2 over the pre-trained UniXCoder model, in a zero-shot setting, on the test split synthesized from the CodeNet
- Score: 13.881954273779403
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: In this paper, we define a neuro-symbolic approach to address the task of
finding semantically similar clones for the codes of the legacy programming
language COBOL, without training data. We define a meta-model that is
instantiated to have an Intermediate Representation (IR) in the form of
Abstract Syntax Trees (ASTs) common across codes in C and COBOL. We linearize
the IRs using Structure Based Traversal (SBT) to create sequential inputs. We
further fine-tune UnixCoder, the best-performing model for zero-shot
cross-programming language code search, for the Code Cloning task with the SBT
IRs of C code-pairs, available in the CodeNet dataset. This allows us to learn
latent representations for the IRs of the C codes, which are transferable to
the IRs of the COBOL codes. With this fine-tuned UnixCoder, we get a
performance improvement of 12.85 MAP@2 over the pre-trained UniXCoder model, in
a zero-shot setting, on the COBOL test split synthesized from the CodeNet
dataset. This demonstrates the efficacy of our meta-model based approach to
facilitate cross-programming language transfer.
Related papers
- Large Language Models for cross-language code clone detection [3.5202378300682162]
Cross-lingual code clone detection has gained traction with the software engineering community.
Inspired by the significant advances in machine learning, this paper revisits cross-lingual code clone detection.
arXiv Detail & Related papers (2024-08-08T12:57:14Z) - SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code
Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects.
We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z) - LILO: Learning Interpretable Libraries by Compressing and Documenting Code [71.55208585024198]
We introduce LILO, a neurosymbolic framework that iteratively synthesizes, compresses, and documents code.
LILO combines LLM-guided program synthesis with recent algorithmic advances in automated from Stitch.
We find that AutoDoc boosts performance by helping LILO's synthesizer to interpret and deploy learned abstractions.
arXiv Detail & Related papers (2023-10-30T17:55:02Z) - LongCoder: A Long-Range Pre-trained Language Model for Code Completion [56.813974784131624]
LongCoder employs a sliding window mechanism for self-attention and introduces two types of globally accessible tokens.
Bridge tokens are inserted throughout the input sequence to aggregate local information and facilitate global interaction.
memory tokens are included to highlight important statements that may be invoked later and need to be memorized.
arXiv Detail & Related papers (2023-06-26T17:59:24Z) - Knowledge Transfer for Pseudo-code Generation from Low Resource
Programming Language [13.716669765394293]
We focus on transferring the knowledge acquired by the code-to-pseudocode neural model trained on a high resource PL (C++) using parallel code-pseudocode data.
We observe an improvement of 23.27% in the success rate of the generated C codes through back translation.
arXiv Detail & Related papers (2023-03-16T03:38:08Z) - ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval.
We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z) - UniXcoder: Unified Cross-Modal Pre-training for Code Representation [65.6846553962117]
We present UniXcoder, a unified cross-modal pre-trained model for programming language.
We propose a one-to-one mapping method to transform AST in a sequence structure that retains all structural information from the tree.
We evaluate UniXcoder on five code-related tasks over nine datasets.
arXiv Detail & Related papers (2022-03-08T04:48:07Z) - Synchromesh: Reliable code generation from pre-trained language models [38.15391794443022]
We propose Synchromesh: a framework for substantially improving the reliability of pre-trained models for code generation.
First, it retrieves few-shot examples from a training bank using Target Similarity Tuning (TST), a novel method for semantic example selection.
Then, Synchromesh feeds the examples to a pre-trained language model and samples programs using Constrained Semantic Decoding (CSD), a general framework for constraining the output to a set of valid programs in the target language.
arXiv Detail & Related papers (2022-01-26T22:57:44Z) - GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code.
We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables.
We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.