Neural Code Search Revisited: Enhancing Code Snippet Retrieval through
Natural Language Intent
- URL: http://arxiv.org/abs/2008.12193v1
- Date: Thu, 27 Aug 2020 15:39:09 GMT
- Title: Neural Code Search Revisited: Enhancing Code Snippet Retrieval through
Natural Language Intent
- Authors: Geert Heyman and Tom Van Cutsem
- Abstract summary: We study how code retrieval systems can be improved by leveraging descriptions to better capture the intents of code snippets.
Building on recent progress in transfer learning and natural language processing, we create a domain-specific retrieval model for code annotated with a natural language description.
- Score: 1.1168121941015012
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we propose and study annotated code search: the retrieval of
code snippets paired with brief descriptions of their intent using natural
language queries. On three benchmark datasets, we investigate how code
retrieval systems can be improved by leveraging descriptions to better capture
the intents of code snippets. Building on recent progress in transfer learning
and natural language processing, we create a domain-specific retrieval model
for code annotated with a natural language description. We find that our model
yields significantly more relevant search results (with absolute gains up to
20.6% in mean reciprocal rank) compared to state-of-the-art code retrieval
methods that do not use descriptions but attempt to compute the intent of
snippets solely from unannotated code.
Related papers
- CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval [103.116634967815]
We introduce CodeXEmbed, a family of large-scale code embedding models ranging from 400M to 7B parameters.
Our novel training pipeline unifies multiple programming languages and transforms various code-related tasks into a common retrieval framework.
Our 7B model sets a new state-of-the-art (SOTA) in code retrieval, outperforming the previous leading model, Voyage-Code, by over 20% on CoIR benchmark.
arXiv Detail & Related papers (2024-11-19T16:54:45Z) - Rewriting the Code: A Simple Method for Large Language Model Augmented Code Search [7.822427053078387]
Generation-Augmented Retrieval (GAR) framework generates exemplar code snippets to augment queries.
We propose a simple yet effective method that additionally Rewrites the Code (ReCo) within the for style normalization.
Code Style Similarity is the first metric tailored to quantify stylistic similarities in code.
arXiv Detail & Related papers (2024-01-09T12:12:50Z) - Generation-Augmented Query Expansion For Code Retrieval [51.20943646688115]
We propose a generation-augmented query expansion framework.
Inspired by the human retrieval process - sketching an answer before searching.
We achieve new state-of-the-art results on the CodeSearchNet benchmark.
arXiv Detail & Related papers (2022-12-20T23:49:37Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval.
We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z) - Learning Deep Semantic Model for Code Search using CodeSearchNet Corpus [17.6095840480926]
We propose a novel deep semantic model which makes use of the utilities of multi-modal sources.
We apply the proposed model to tackle the CodeSearchNet challenge about semantic code search.
Our model is trained on CodeSearchNet corpus and evaluated on the held-out data, the final model achieves 0.384 NDCG and won the first place in this benchmark.
arXiv Detail & Related papers (2022-01-27T04:15:59Z) - CodeRetriever: Unimodal and Bimodal Contrastive Learning [128.06072658302165]
We propose the CodeRetriever model, which combines the unimodal and bimodal contrastive learning to train function-level code semantic representations.
For unimodal contrastive learning, we design a semantic-guided method to build positive code pairs based on the documentation and function name.
For bimodal contrastive learning, we leverage the documentation and in-line comments of code to build text-code pairs.
arXiv Detail & Related papers (2022-01-26T10:54:30Z) - AugmentedCode: Examining the Effects of Natural Language Resources in
Code Retrieval Models [5.112140303263898]
We introduce Augmented Code (AugmentedCode) retrieval which takes advantage of existing information within the code.
We showcased the the results of augmented programming language which outperforms on CodeSearchNet and CodeBERT with a Mean Reciprocal Rank (MRR) of 0.73 and 0.96.
arXiv Detail & Related papers (2021-10-16T08:44:48Z) - BERT2Code: Can Pretrained Language Models be Leveraged for Code Search? [0.7953229555481884]
We show that our model learns the inherent relationship between the embedding spaces and further probes into the scope of improvement.
In this analysis, we show that the quality of the code embedding model is the bottleneck for our model's performance.
arXiv Detail & Related papers (2021-04-16T10:28:27Z) - Deep Graph Matching and Searching for Semantic Code Retrieval [76.51445515611469]
We propose an end-to-end deep graph matching and searching model based on graph neural networks.
We first represent both natural language query texts and programming language code snippets with the unified graph-structured data.
In particular, DGMS not only captures more structural information for individual query texts or code snippets but also learns the fine-grained similarity between them.
arXiv Detail & Related papers (2020-10-24T14:16:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.