Evaluating and Optimizing the Effectiveness of Neural Machine
Translation in Supporting Code Retrieval Models: A Study on the CAT Benchmark
- URL: http://arxiv.org/abs/2308.04693v1
- Date: Wed, 9 Aug 2023 04:06:24 GMT
- Title: Evaluating and Optimizing the Effectiveness of Neural Machine
Translation in Supporting Code Retrieval Models: A Study on the CAT Benchmark
- Authors: Hung Phan and Ali Jannesari
- Abstract summary: We analyze the performance of NMT in natural language-to-code translation in the newly curated CAT benchmark.
We propose ASTTrans Representation, a tailored representation of an Abstract Syntax Tree (AST) using a subset of non-terminal nodes.
Our NMT models of learning ASTTrans Representation can boost the Mean Reciprocal Rank of these state-of-the-art code search processes by up to 3.08%.
- Score: 8.3017581766084
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neural Machine Translation (NMT) is widely applied in software engineering
tasks. The effectiveness of NMT for code retrieval relies on the ability to
learn from the sequence of tokens in the source language to the sequence of
tokens in the target language. While NMT performs well in pseudocode-to-code
translation, it might have challenges in learning to translate from natural
language query to source code in newly curated real-world code documentation/
implementation datasets. In this work, we analyze the performance of NMT in
natural language-to-code translation in the newly curated CAT benchmark that
includes the optimized versions of three Java datasets TLCodeSum,
CodeSearchNet, Funcom, and a Python dataset PCSD. Our evaluation shows that NMT
has low accuracy, measured by CrystalBLEU and Meteor metrics in this task. To
alleviate the duty of NMT in learning complex representation of source code, we
propose ASTTrans Representation, a tailored representation of an Abstract
Syntax Tree (AST) using a subset of non-terminal nodes. We show that the
classical approach NMT performs significantly better in learning ASTTrans
Representation over code tokens with up to 36% improvement on Meteor score.
Moreover, we leverage ASTTrans Representation to conduct combined code search
processes from the state-of-the-art code search processes using GraphCodeBERT
and UniXcoder. Our NMT models of learning ASTTrans Representation can boost the
Mean Reciprocal Rank of these state-of-the-art code search processes by up to
3.08% and improve 23.08% of queries' results over the CAT benchmark.
Related papers
- VerMCTS: Synthesizing Multi-Step Programs using a Verifier, a Large Language Model, and Tree Search [5.389248707675898]
Large Language Models (LLMs) can generate useful code, but often the code they generate cannot be trusted to be sound.
We present VerMCTS, an approach to begin to resolve this issue by generating verified programs in Dafny and Coq.
arXiv Detail & Related papers (2024-02-13T00:55:14Z) - Abstract Syntax Tree for Programming Language Understanding and
Representation: How Far Are We? [23.52632194060246]
Programming language understanding and representation (a.k.a code representation learning) has always been a hot and challenging task in software engineering.
The abstract syntax tree (AST), a fundamental code feature, illustrates the syntactic information of the source code and has been widely used in code representation learning.
We compare the performance of models trained with code token sequence (Token for short) based code representation and AST-based code representation on three popular types of code-related tasks.
arXiv Detail & Related papers (2023-12-01T08:37:27Z) - Neural Machine Translation for Code Generation [0.7607163273993514]
In NMT for code generation, the task is to generate source code that satisfies constraints expressed in the input.
In this paper we survey the NMT for code generation literature, cataloging the variety of methods that have been explored.
We discuss the limitations of existing methods and future research directions.
arXiv Detail & Related papers (2023-05-22T21:43:12Z) - Learning Homographic Disambiguation Representation for Neural Machine
Translation [20.242134720005467]
Homographs, words with the same spelling but different meanings, remain challenging in Neural Machine Translation (NMT)
We propose a novel approach to tackle issues of NMT in the latent space.
We first train an encoder (aka " homographic-encoder") to learn universal sentence representations in a natural language inference (NLI) task.
We further fine-tune the encoder using homograph-based syn-set WordNet, enabling it to learn word-set representations from sentences.
arXiv Detail & Related papers (2023-04-12T13:42:59Z) - Quality-Aware Decoding for Neural Machine Translation [64.24934199944875]
We propose quality-aware decoding for neural machine translation (NMT)
We leverage recent breakthroughs in reference-free and reference-based MT evaluation through various inference methods.
We find that quality-aware decoding consistently outperforms MAP-based decoding according both to state-of-the-art automatic metrics and to human assessments.
arXiv Detail & Related papers (2022-05-02T15:26:28Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - Exploring Unsupervised Pretraining Objectives for Machine Translation [99.5441395624651]
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT)
Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder.
We compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context.
arXiv Detail & Related papers (2021-06-10T10:18:23Z) - Exploiting Neural Query Translation into Cross Lingual Information
Retrieval [49.167049709403166]
Existing CLIR systems mainly exploit statistical-based machine translation (SMT) rather than the advanced neural machine translation (NMT)
We propose a novel data augmentation method that extracts query translation pairs according to user clickthrough data.
Experimental results reveal that the proposed approach yields better retrieval quality than strong baselines.
arXiv Detail & Related papers (2020-10-26T15:28:19Z) - Encodings of Source Syntax: Similarities in NMT Representations Across
Target Languages [3.464656011246703]
We find that NMT encoders learn similar source syntax regardless of NMT target language.
NMT encoders outperform RNNs trained directly on several of the constituent label prediction tasks.
arXiv Detail & Related papers (2020-05-17T06:41:32Z) - Neural Machine Translation: Challenges, Progress and Future [62.75523637241876]
Machine translation (MT) is a technique that leverages computers to translate human languages automatically.
neural machine translation (NMT) models direct mapping between source and target languages with deep neural networks.
This article makes a review of NMT framework, discusses the challenges in NMT and introduces some exciting recent progresses.
arXiv Detail & Related papers (2020-04-13T07:53:57Z) - Explicit Reordering for Neural Machine Translation [50.70683739103066]
In Transformer-based neural machine translation (NMT), the positional encoding mechanism helps the self-attention networks to learn the source representation with order dependency.
We propose a novel reordering method to explicitly model this reordering information for the Transformer-based NMT.
The empirical results on the WMT14 English-to-German, WAT ASPEC Japanese-to-English, and WMT17 Chinese-to-English translation tasks show the effectiveness of the proposed approach.
arXiv Detail & Related papers (2020-04-08T05:28:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.