Search4Code: Code Search Intent Classification Using Weak Supervision
- URL: http://arxiv.org/abs/2011.11950v3
- Date: Sat, 20 Mar 2021 15:01:41 GMT
- Title: Search4Code: Code Search Intent Classification Using Weak Supervision
- Authors: Nikitha Rao, Chetan Bansal and Joe Guan
- Abstract summary: We propose a weak supervision based approach for detecting code search intent in search queries for C# and Java programming languages.
We evaluate the approach against several baselines on a real-world dataset comprised of over 1 million queries mined from Bing web search engine.
We are also releasing Search4Code, the first large-scale real-world dataset of code search queries mined from Bing web search engine.
- Score: 5.441318460204245
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Developers use search for various tasks such as finding code, documentation,
debugging information, etc. In particular, web search is heavily used by
developers for finding code examples and snippets during the coding process.
Recently, natural language based code search has been an active area of
research. However, the lack of real-world large-scale datasets is a significant
bottleneck. In this work, we propose a weak supervision based approach for
detecting code search intent in search queries for C# and Java programming
languages. We evaluate the approach against several baselines on a real-world
dataset comprised of over 1 million queries mined from Bing web search engine
and show that the CNN based model can achieve an accuracy of 77% and 76% for C#
and Java respectively. Furthermore, we are also releasing Search4Code, the
first large-scale real-world dataset of code search queries mined from Bing web
search engine. We hope that the dataset will aid future research on code
search.
Related papers
- CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval [103.116634967815]
We introduce CodeXEmbed, a family of large-scale code embedding models ranging from 400M to 7B parameters.
Our novel training pipeline unifies multiple programming languages and transforms various code-related tasks into a common retrieval framework.
Our 7B model sets a new state-of-the-art (SOTA) in code retrieval, outperforming the previous leading model, Voyage-Code, by over 20% on CoIR benchmark.
arXiv Detail & Related papers (2024-11-19T16:54:45Z) - RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation [65.5353313491402]
We introduce RethinkMCTS, which employs the Monte Carlo Tree Search (MCTS) algorithm to conduct thought-level searches before generating code.
We construct verbal feedback from fine-turbo code execution feedback to refine erroneous thoughts during the search.
We demonstrate that RethinkMCTS outperforms previous search-based and feedback-based code generation baselines.
arXiv Detail & Related papers (2024-09-15T02:07:28Z) - Searching by Code: a New SearchBySnippet Dataset and SnippeR Retrieval Model for Searching by Code Snippets [7.948526577271158]
We argue that using a code snippet as a query while looking for bugfixing instructions and code samples is a natural use case not covered by prior art.
We present a new SearchBySnippet dataset implementing the search-by-code use case based on StackOverflow data.
arXiv Detail & Related papers (2023-05-19T12:09:30Z) - Survey of Code Search Based on Deep Learning [11.94599964179766]
This survey focuses on code search, that is, to retrieve code that matches a given query.
Deep learning, being able to extract complex semantics information, has achieved great success in this field.
We propose a new taxonomy to illustrate the state-of-the-art deep learning-based code search.
arXiv Detail & Related papers (2023-05-10T08:07:04Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - Accelerating Code Search with Deep Hashing and Code Classification [64.3543949306799]
Code search is to search reusable code snippets from source code corpus based on natural languages queries.
We propose a novel method CoSHC to accelerate code search with deep hashing and code classification.
arXiv Detail & Related papers (2022-03-29T07:05:30Z) - Learning Deep Semantic Model for Code Search using CodeSearchNet Corpus [17.6095840480926]
We propose a novel deep semantic model which makes use of the utilities of multi-modal sources.
We apply the proposed model to tackle the CodeSearchNet challenge about semantic code search.
Our model is trained on CodeSearchNet corpus and evaluated on the held-out data, the final model achieves 0.384 NDCG and won the first place in this benchmark.
arXiv Detail & Related papers (2022-01-27T04:15:59Z) - BERT2Code: Can Pretrained Language Models be Leveraged for Code Search? [0.7953229555481884]
We show that our model learns the inherent relationship between the embedding spaces and further probes into the scope of improvement.
In this analysis, we show that the quality of the code embedding model is the bottleneck for our model's performance.
arXiv Detail & Related papers (2021-04-16T10:28:27Z) - Deep Graph Matching and Searching for Semantic Code Retrieval [76.51445515611469]
We propose an end-to-end deep graph matching and searching model based on graph neural networks.
We first represent both natural language query texts and programming language code snippets with the unified graph-structured data.
In particular, DGMS not only captures more structural information for individual query texts or code snippets but also learns the fine-grained similarity between them.
arXiv Detail & Related papers (2020-10-24T14:16:50Z) - COSEA: Convolutional Code Search with Layer-wise Attention [90.35777733464354]
We propose a new deep learning architecture, COSEA, which leverages convolutional neural networks with layer-wise attention to capture the code's intrinsic structural logic.
COSEA can achieve significant improvements over state-of-the-art methods on code search tasks.
arXiv Detail & Related papers (2020-10-19T13:53:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.