Structural Code Search using Natural Language Queries
- URL: http://arxiv.org/abs/2507.02107v1
- Date: Wed, 02 Jul 2025 19:42:37 GMT
- Title: Structural Code Search using Natural Language Queries
- Authors: Ben Limpanukorn, Yanjun Wang, Zach Patterson, Pranav Garg, Murali Krishna Ramanathan, Xiaofei Ma, Anoop Deoras, Miryung Kim,
- Abstract summary: We propose to allow developers to use natural language to search for code structurally.<n>Expressing queries in natural language provides an intuitive way to search for code and lowers the barrier to entry.<n>We show that our approach for structural code search based on translating NL queries to DSL queries using an LLM is effective and robust.
- Score: 14.915304679162713
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Searching code is a common task that developers perform to understand APIs, learn common code patterns, and navigate code. Currently, developers most commonly search using keywords and regular expressions that are easy to use and widely available. Beyond keywords and regular expressions, structural code search tools allow developers to search for code based on its syntactic structure. This has numerous applications ranging from bug finding to systematically refactoring code. However, these structural code search tools operate on queries expressed in domain-specific languages (DSL) that can be difficult to learn and write. We propose to allow developers to use natural language to search for code structurally. Expressing queries in natural language provides an intuitive way to search for code and lowers the barrier to entry. In this work, we develop a novel general approach that combines the reasoning capabilities of an LLM to interpret natural language search queries with the power of structural search tools to efficiently and accurately retrieve relevant code. We then instantiate this approach for two structural code search DSLs: Semgrep and GQL. In our evaluation, we construct a new benchmark for structural code search consisting of 400 queries over 10 Java projects. We show that our approach for structural code search based on translating NL queries to DSL queries using an LLM is effective and robust, achieving a high precision and recall ranging from 55% - 70%. Further, our approach significantly outperforms baselines based on semantic code search and LLM retrievals by up to 57% and 14% on F1 scores.
Related papers
- IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z) - CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval [103.116634967815]
We introduce CodeXEmbed, a family of large-scale code embedding models ranging from 400M to 7B parameters.
Our novel training pipeline unifies multiple programming languages and transforms various code-related tasks into a common retrieval framework.
Our 7B model sets a new state-of-the-art (SOTA) in code retrieval, outperforming the previous leading model, Voyage-Code, by over 20% on CoIR benchmark.
arXiv Detail & Related papers (2024-11-19T16:54:45Z) - Leveraging LLMs to Enable Natural Language Search on Go-to-market Platforms [0.23301643766310368]
We implement and evaluate a solution for the Zoominfo product for sellers, which prompts the Large Language Models with natural language.
The intermediary search fields offer numerous advantages for each query, including the elimination of syntax errors.
Comprehensive experiments with closed, open source, and fine-tuned LLM models were conducted to demonstrate the efficacy of our approach.
arXiv Detail & Related papers (2024-11-07T03:58:38Z) - Large Search Model: Redefining Search Stack in the Era of LLMs [63.503320030117145]
We introduce a novel conceptual framework called large search model, which redefines the conventional search stack by unifying search tasks with one large language model (LLM)
All tasks are formulated as autoregressive text generation problems, allowing for the customization of tasks through the use of natural language prompts.
This proposed framework capitalizes on the strong language understanding and reasoning capabilities of LLMs, offering the potential to enhance search result quality while simultaneously simplifying the existing cumbersome search stack.
arXiv Detail & Related papers (2023-10-23T05:52:09Z) - Searching by Code: a New SearchBySnippet Dataset and SnippeR Retrieval Model for Searching by Code Snippets [7.948526577271158]
We argue that using a code snippet as a query while looking for bugfixing instructions and code samples is a natural use case not covered by prior art.
We present a new SearchBySnippet dataset implementing the search-by-code use case based on StackOverflow data.
arXiv Detail & Related papers (2023-05-19T12:09:30Z) - Generation-Augmented Query Expansion For Code Retrieval [51.20943646688115]
We propose a generation-augmented query expansion framework.
Inspired by the human retrieval process - sketching an answer before searching.
We achieve new state-of-the-art results on the CodeSearchNet benchmark.
arXiv Detail & Related papers (2022-12-20T23:49:37Z) - CoSQA: 20,000+ Web Queries for Code Search and Question Answering [63.92224685262063]
CoSQA dataset includes 20,604 labels for pairs of natural language queries and codes.
We introduce a contrastive learning method dubbed CoCLR to enhance query-code matching.
We show that evaluated on CodeXGLUE with the same CodeBERT model, training on CoSQA improves the accuracy of code question answering by 5.1%.
arXiv Detail & Related papers (2021-05-27T15:37:21Z) - BERT2Code: Can Pretrained Language Models be Leveraged for Code Search? [0.7953229555481884]
We show that our model learns the inherent relationship between the embedding spaces and further probes into the scope of improvement.
In this analysis, we show that the quality of the code embedding model is the bottleneck for our model's performance.
arXiv Detail & Related papers (2021-04-16T10:28:27Z) - Search4Code: Code Search Intent Classification Using Weak Supervision [5.441318460204245]
We propose a weak supervision based approach for detecting code search intent in search queries for C# and Java programming languages.
We evaluate the approach against several baselines on a real-world dataset comprised of over 1 million queries mined from Bing web search engine.
We are also releasing Search4Code, the first large-scale real-world dataset of code search queries mined from Bing web search engine.
arXiv Detail & Related papers (2020-11-24T08:06:53Z) - Deep Graph Matching and Searching for Semantic Code Retrieval [76.51445515611469]
We propose an end-to-end deep graph matching and searching model based on graph neural networks.
We first represent both natural language query texts and programming language code snippets with the unified graph-structured data.
In particular, DGMS not only captures more structural information for individual query texts or code snippets but also learns the fine-grained similarity between them.
arXiv Detail & Related papers (2020-10-24T14:16:50Z) - Neural Code Search Revisited: Enhancing Code Snippet Retrieval through
Natural Language Intent [1.1168121941015012]
We study how code retrieval systems can be improved by leveraging descriptions to better capture the intents of code snippets.
Building on recent progress in transfer learning and natural language processing, we create a domain-specific retrieval model for code annotated with a natural language description.
arXiv Detail & Related papers (2020-08-27T15:39:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.