Related papers: Evaluating and Optimizing the Effectiveness of Neural Machine Translation in Supporting Code Retrieval Models: A Study on the CAT Benchmark

Evaluating and Optimizing the Effectiveness of Neural Machine Translation in Supporting Code Retrieval Models: A Study on the CAT Benchmark

URL: http://arxiv.org/abs/2308.04693v1
Date: Wed, 9 Aug 2023 04:06:24 GMT
Title: Evaluating and Optimizing the Effectiveness of Neural Machine Translation in Supporting Code Retrieval Models: A Study on the CAT Benchmark
Authors: Hung Phan and Ali Jannesari
Abstract summary: We analyze the performance of NMT in natural language-to-code translation in the newly curated CAT benchmark. We propose ASTTrans Representation, a tailored representation of an Abstract Syntax Tree (AST) using a subset of non-terminal nodes. Our NMT models of learning ASTTrans Representation can boost the Mean Reciprocal Rank of these state-of-the-art code search processes by up to 3.08%.
Score: 8.3017581766084
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Neural Machine Translation (NMT) is widely applied in software engineering tasks. The effectiveness of NMT for code retrieval relies on the ability to learn from the sequence of tokens in the source language to the sequence of tokens in the target language. While NMT performs well in pseudocode-to-code translation, it might have challenges in learning to translate from natural language query to source code in newly curated real-world code documentation/ implementation datasets. In this work, we analyze the performance of NMT in natural language-to-code translation in the newly curated CAT benchmark that includes the optimized versions of three Java datasets TLCodeSum, CodeSearchNet, Funcom, and a Python dataset PCSD. Our evaluation shows that NMT has low accuracy, measured by CrystalBLEU and Meteor metrics in this task. To alleviate the duty of NMT in learning complex representation of source code, we propose ASTTrans Representation, a tailored representation of an Abstract Syntax Tree (AST) using a subset of non-terminal nodes. We show that the classical approach NMT performs significantly better in learning ASTTrans Representation over code tokens with up to 36% improvement on Meteor score. Moreover, we leverage ASTTrans Representation to conduct combined code search processes from the state-of-the-art code search processes using GraphCodeBERT and UniXcoder. Our NMT models of learning ASTTrans Representation can boost the Mean Reciprocal Rank of these state-of-the-art code search processes by up to 3.08% and improve 23.08% of queries' results over the CAT benchmark.

Related papers

Beyond Decoder-only: Large Language Models Can be Good Encoders for Machine Translation [40.72168378706009]
We explore translation models that are universal, efficient, and easy to optimize. We apply large language models (LLMs) to NMT encoding and leave the NMT decoder unchanged. We construct a new dataset involving multiple tasks to assess how well the machine translation system generalizes.
arXiv Detail & Related papers (2025-03-09T12:54:05Z)
Specification-Driven Code Translation Powered by Large Language Models: How Far Are We? [8.534857249221844]
We investigate using NL-specification as an intermediate representation for code translation. Our results show that using NL-specification alone does not lead to performance improvements. Besides analyzing the performance of code translation, we also investigate the quality of the translated code.
arXiv Detail & Related papers (2024-12-05T20:10:21Z)
VerMCTS: Synthesizing Multi-Step Programs using a Verifier, a Large Language Model, and Tree Search [5.389248707675898]
Large Language Models (LLMs) can generate useful code, but often the code they generate cannot be trusted to be sound. We present VerMCTS, an approach to begin to resolve this issue by generating verified programs in Dafny and Coq.
arXiv Detail & Related papers (2024-02-13T00:55:14Z)
Abstract Syntax Tree for Programming Language Understanding and Representation: How Far Are We? [23.52632194060246]
Programming language understanding and representation (a.k.a code representation learning) has always been a hot and challenging task in software engineering. The abstract syntax tree (AST), a fundamental code feature, illustrates the syntactic information of the source code and has been widely used in code representation learning. We compare the performance of models trained with code token sequence (Token for short) based code representation and AST-based code representation on three popular types of code-related tasks.
arXiv Detail & Related papers (2023-12-01T08:37:27Z)
Neural Machine Translation for Code Generation [0.7607163273993514]
In NMT for code generation, the task is to generate source code that satisfies constraints expressed in the input. In this paper we survey the NMT for code generation literature, cataloging the variety of methods that have been explored. We discuss the limitations of existing methods and future research directions.
arXiv Detail & Related papers (2023-05-22T21:43:12Z)
Learning Homographic Disambiguation Representation for Neural Machine Translation [20.242134720005467]
Homographs, words with the same spelling but different meanings, remain challenging in Neural Machine Translation (NMT) We propose a novel approach to tackle issues of NMT in the latent space. We first train an encoder (aka " homographic-encoder") to learn universal sentence representations in a natural language inference (NLI) task. We further fine-tune the encoder using homograph-based syn-set WordNet, enabling it to learn word-set representations from sentences.
arXiv Detail & Related papers (2023-04-12T13:42:59Z)
Quality-Aware Decoding for Neural Machine Translation [64.24934199944875]
We propose quality-aware decoding for neural machine translation (NMT) We leverage recent breakthroughs in reference-free and reference-based MT evaluation through various inference methods. We find that quality-aware decoding consistently outperforms MAP-based decoding according both to state-of-the-art automatic metrics and to human assessments.
arXiv Detail & Related papers (2022-05-02T15:26:28Z)
Enhancing Semantic Code Search with Multimodal Contrastive Learning and Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search. We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z)
Exploring Unsupervised Pretraining Objectives for Machine Translation [99.5441395624651]
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT) Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder. We compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context.
arXiv Detail & Related papers (2021-06-10T10:18:23Z)
Exploiting Neural Query Translation into Cross Lingual Information Retrieval [49.167049709403166]
Existing CLIR systems mainly exploit statistical-based machine translation (SMT) rather than the advanced neural machine translation (NMT) We propose a novel data augmentation method that extracts query translation pairs according to user clickthrough data. Experimental results reveal that the proposed approach yields better retrieval quality than strong baselines.
arXiv Detail & Related papers (2020-10-26T15:28:19Z)
Encodings of Source Syntax: Similarities in NMT Representations Across Target Languages [3.464656011246703]
We find that NMT encoders learn similar source syntax regardless of NMT target language. NMT encoders outperform RNNs trained directly on several of the constituent label prediction tasks.
arXiv Detail & Related papers (2020-05-17T06:41:32Z)
Neural Machine Translation: Challenges, Progress and Future [62.75523637241876]
Machine translation (MT) is a technique that leverages computers to translate human languages automatically. neural machine translation (NMT) models direct mapping between source and target languages with deep neural networks. This article makes a review of NMT framework, discusses the challenges in NMT and introduces some exciting recent progresses.
arXiv Detail & Related papers (2020-04-13T07:53:57Z)
Explicit Reordering for Neural Machine Translation [50.70683739103066]
In Transformer-based neural machine translation (NMT), the positional encoding mechanism helps the self-attention networks to learn the source representation with order dependency. We propose a novel reordering method to explicitly model this reordering information for the Transformer-based NMT. The empirical results on the WMT14 English-to-German, WAT ASPEC Japanese-to-English, and WMT17 Chinese-to-English translation tasks show the effectiveness of the proposed approach.
arXiv Detail & Related papers (2020-04-08T05:28:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.