Code Search based on Context-aware Code Translation
- URL: http://arxiv.org/abs/2202.08029v1
- Date: Wed, 16 Feb 2022 12:45:47 GMT
- Title: Code Search based on Context-aware Code Translation
- Authors: Weisong Sun and Chunrong Fang and Yuchen Chen and Guanhong Tao and
Tingxu Han and Quanjun Zhang
- Abstract summary: Existing techniques leverage deep learning models to construct embedding representations for code snippets and queries.
We propose a novel context-aware code translation technique that translates code snippets into natural language descriptions.
We evaluate the effectiveness of our technique, called TranCS, on the CodeSearchNet corpus with 1,000 queries.
- Score: 9.346066889885684
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Code search is a widely used technique by developers during software
development. It provides semantically similar implementations from a large code
corpus to developers based on their queries. Existing techniques leverage deep
learning models to construct embedding representations for code snippets and
queries, respectively. Features such as abstract syntactic trees, control flow
graphs, etc., are commonly employed for representing the semantics of code
snippets. However, the same structure of these features does not necessarily
denote the same semantics of code snippets, and vice versa. In addition, these
techniques utilize multiple different word mapping functions that map query
words/code tokens to embedding representations. This causes diverged embeddings
of the same word/token in queries and code snippets. We propose a novel
context-aware code translation technique that translates code snippets into
natural language descriptions (called translations). The code translation is
conducted on machine instructions, where the context information is collected
by simulating the execution of instructions. We further design a shared word
mapping function using one single vocabulary for generating embeddings for both
translations and queries. We evaluate the effectiveness of our technique,
called TranCS, on the CodeSearchNet corpus with 1,000 queries. Experimental
results show that TranCS significantly outperforms state-of-the-art techniques
by 49.31% to 66.50% in terms of MRR (mean reciprocal rank).
Related papers
- When simplicity meets effectiveness: Detecting code comments coherence with word embeddings and LSTM [6.417777780911223]
Code comments play a crucial role in software development, as they provide programmers with practical information.
Developers tend to leave comments unchanged after updating the code, resulting in a discrepancy between the two artifacts.
It is crucial to identify if, given a code snippet, its corresponding comment is coherent and reflects well the intent behind the code.
arXiv Detail & Related papers (2024-05-25T15:21:27Z) - Soft-Labeled Contrastive Pre-training for Function-level Code
Representation [127.71430696347174]
We present textbfSCodeR, a textbfSoft-labeled contrastive pre-training framework with two positive sample construction methods.
Considering the relevance between codes in a large-scale code corpus, the soft-labeled contrastive pre-training can obtain fine-grained soft-labels.
SCodeR achieves new state-of-the-art performance on four code-related tasks over seven datasets.
arXiv Detail & Related papers (2022-10-18T05:17:37Z) - NS3: Neuro-Symbolic Semantic Code Search [33.583344165521645]
We use a Neural Module Network architecture to implement this idea.
We compare our model - NS3 (Neuro-Symbolic Semantic Search) - to a number of baselines, including state-of-the-art semantic code retrieval methods.
We demonstrate that our approach results in more precise code retrieval, and we study the effectiveness of our modular design when handling compositional queries.
arXiv Detail & Related papers (2022-05-21T20:55:57Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval.
We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z) - CodeRetriever: Unimodal and Bimodal Contrastive Learning [128.06072658302165]
We propose the CodeRetriever model, which combines the unimodal and bimodal contrastive learning to train function-level code semantic representations.
For unimodal contrastive learning, we design a semantic-guided method to build positive code pairs based on the documentation and function name.
For bimodal contrastive learning, we leverage the documentation and in-line comments of code to build text-code pairs.
arXiv Detail & Related papers (2022-01-26T10:54:30Z) - Multimodal Representation for Neural Code Search [18.371048875103497]
We introduce tree-serialization methods on a simplified form of AST and build the multimodal representation for the code data.
Our results show that both our tree-serialized representations and multimodal learning model improve the performance of neural code search.
arXiv Detail & Related papers (2021-07-02T12:08:19Z) - Deep Graph Matching and Searching for Semantic Code Retrieval [76.51445515611469]
We propose an end-to-end deep graph matching and searching model based on graph neural networks.
We first represent both natural language query texts and programming language code snippets with the unified graph-structured data.
In particular, DGMS not only captures more structural information for individual query texts or code snippets but also learns the fine-grained similarity between them.
arXiv Detail & Related papers (2020-10-24T14:16:50Z) - GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code.
We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables.
We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z) - Self-Supervised Contrastive Learning for Code Retrieval and
Summarization via Semantic-Preserving Transformations [28.61567319928316]
Corder is a self-supervised contrastive learning framework for source code model.
Key innovation is that we train the source code model by asking it to recognize similar and dissimilar code snippets.
We have shown that the code models pretrained by Corder substantially outperform the other baselines for code-to-code retrieval, text-to-code retrieval, and code-to-text summarization tasks.
arXiv Detail & Related papers (2020-09-06T13:31:16Z) - Neural Code Search Revisited: Enhancing Code Snippet Retrieval through
Natural Language Intent [1.1168121941015012]
We study how code retrieval systems can be improved by leveraging descriptions to better capture the intents of code snippets.
Building on recent progress in transfer learning and natural language processing, we create a domain-specific retrieval model for code annotated with a natural language description.
arXiv Detail & Related papers (2020-08-27T15:39:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.