A Survey of Source Code Search: A 3-Dimensional Perspective
- URL: http://arxiv.org/abs/2311.07107v1
- Date: Mon, 13 Nov 2023 06:42:08 GMT
- Title: A Survey of Source Code Search: A 3-Dimensional Perspective
- Authors: Weisong Sun, Chunrong Fang, Yifei Ge, Yuling Hu, Yuchen Chen, Quanjun
Zhang, Xiuting Ge, Yang Liu, Zhenyu Chen
- Abstract summary: Code search is widely concerned by software engineering researchers because it can improve the productivity and quality of software development.
To realize effective and efficient code search, many techniques have been proposed successively.
- Score: 17.524674603550043
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: (Source) code search is widely concerned by software engineering researchers
because it can improve the productivity and quality of software development.
Given a functionality requirement usually described in a natural language
sentence, a code search system can retrieve code snippets that satisfy the
requirement from a large-scale code corpus, e.g., GitHub. To realize effective
and efficient code search, many techniques have been proposed successively.
These techniques improve code search performance mainly by optimizing three
core components, including query understanding component, code understanding
component, and query-code matching component. In this paper, we provide a
3-dimensional perspective survey for code search. Specifically, we categorize
existing code search studies into query-end optimization techniques, code-end
optimization techniques, and match-end optimization techniques according to the
specific components they optimize. Considering that each end can be optimized
independently and contributes to the code search performance, we treat each end
as a dimension. Therefore, this survey is 3-dimensional in nature, and it
provides a comprehensive summary of each dimension in detail. To understand the
research trends of the three dimensions in existing code search studies, we
systematically review 68 relevant literatures. Different from existing code
search surveys that only focus on the query end or code end or introduce
various aspects shallowly (including codebase, evaluation metrics, modeling
technique, etc.), our survey provides a more nuanced analysis and review of the
evolution and development of the underlying techniques used in the three ends.
Based on a systematic review and summary of existing work, we outline several
open challenges and opportunities at the three ends that remain to be addressed
in future work.
Related papers
- SmartSearch: Process Reward-Guided Query Refinement for Search Agents [63.46067892354375]
Large language model (LLM)-based search agents have proven promising for addressing knowledge-intensive problems.<n>Existing works largely focus on optimizing the reasoning paradigms of search agents, yet the quality of intermediate search queries during reasoning remains overlooked.<n>We introduce SmartSearch, a framework built upon two key mechanisms to mitigate this issue.
arXiv Detail & Related papers (2026-01-08T12:39:05Z) - From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence [150.3696990310269]
Large language models (LLMs) have transformed automated software development by enabling direct translation of natural language descriptions into functional code.<n>We provide a comprehensive synthesis and practical guide (a series of analytic and probing experiments) about code LLMs.<n>We analyze the code capability of the general LLMs (GPT-4, Claude, LLaMA) and code-specialized LLMs (StarCoder, Code LLaMA, DeepSeek-Coder, and QwenCoder)
arXiv Detail & Related papers (2025-11-23T17:09:34Z) - A Survey on Code Generation with LLM-based Agents [61.474191493322415]
Code generation agents powered by large language models (LLMs) are revolutionizing the software development paradigm.<n>LLMs are characterized by three core features.<n>This paper presents a systematic survey of the field of LLM-based code generation agents.
arXiv Detail & Related papers (2025-07-31T18:17:36Z) - An Empirical Study on the Effectiveness of Large Language Models for Binary Code Understanding [50.17907898478795]
This work proposes a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in real-world reverse engineering scenarios.
Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2025-04-30T17:02:06Z) - CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval [103.116634967815]
We introduce CodeXEmbed, a family of large-scale code embedding models ranging from 400M to 7B parameters.
Our novel training pipeline unifies multiple programming languages and transforms various code-related tasks into a common retrieval framework.
Our 7B model sets a new state-of-the-art (SOTA) in code retrieval, outperforming the previous leading model, Voyage-Code, by over 20% on CoIR benchmark.
arXiv Detail & Related papers (2024-11-19T16:54:45Z) - RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation [65.5353313491402]
We introduce RethinkMCTS, which employs the Monte Carlo Tree Search (MCTS) algorithm to conduct thought-level searches before generating code.
We construct verbal feedback from fine-turbo code execution feedback to refine erroneous thoughts during the search.
We demonstrate that RethinkMCTS outperforms previous search-based and feedback-based code generation baselines.
arXiv Detail & Related papers (2024-09-15T02:07:28Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - Prompt-based Code Completion via Multi-Retrieval Augmented Generation [15.233727939816388]
ProCC is a code completion framework leveraging prompt engineering and the contextual multi-armed bandits algorithm.
ProCC outperforms state-of-the-art code completion technique by 8.6% on our collected open-source benchmark suite.
ProCC also allows augmenting fine-tuned techniques in a plug-and-play manner, yielding 5.6% improvement over our studied fine-tuned model.
arXiv Detail & Related papers (2024-05-13T07:56:15Z) - Survey of Code Search Based on Deep Learning [11.94599964179766]
This survey focuses on code search, that is, to retrieve code that matches a given query.
Deep learning, being able to extract complex semantics information, has achieved great success in this field.
We propose a new taxonomy to illustrate the state-of-the-art deep learning-based code search.
arXiv Detail & Related papers (2023-05-10T08:07:04Z) - Deep Learning Based Code Generation Methods: Literature Review [30.17038624027751]
This paper focuses on Code Generation task that aims at generating relevant code fragments according to given natural language descriptions.
In this paper, we systematically review the current work on deep learning-based code generation methods.
arXiv Detail & Related papers (2023-03-02T08:25:42Z) - Exploring Representation-Level Augmentation for Code Search [50.94201167562845]
We explore augmentation methods that augment data (both code and query) at representation level which does not require additional data processing and training.
We experimentally evaluate the proposed representation-level augmentation methods with state-of-the-art code search models on a large-scale public dataset.
arXiv Detail & Related papers (2022-10-21T22:47:37Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - Learning Program Semantics with Code Representations: An Empirical Study [22.953964699210296]
Program semantics learning is the core and fundamental for various code intelligent tasks.
We categorize current mainstream code representation techniques into four categories.
We evaluate its performance on three diverse and popular code intelligent tasks.
arXiv Detail & Related papers (2022-03-22T14:51:44Z) - COSEA: Convolutional Code Search with Layer-wise Attention [90.35777733464354]
We propose a new deep learning architecture, COSEA, which leverages convolutional neural networks with layer-wise attention to capture the code's intrinsic structural logic.
COSEA can achieve significant improvements over state-of-the-art methods on code search tasks.
arXiv Detail & Related papers (2020-10-19T13:53:38Z) - CoNCRA: A Convolutional Neural Network Code Retrieval Approach [0.0]
We propose a technique for semantic code search: A Convolutional Neural Network approach to code retrieval.
Our technique aims to find the code snippet that most closely matches the developer's intent, expressed in natural language.
We evaluated our approach's efficacy on a dataset composed of questions and code snippets collected from Stack Overflow.
arXiv Detail & Related papers (2020-09-03T23:38:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.