OCoR: An Overlapping-Aware Code Retriever
- URL: http://arxiv.org/abs/2008.05201v2
- Date: Thu, 20 Aug 2020 12:05:03 GMT
- Title: OCoR: An Overlapping-Aware Code Retriever
- Authors: Qihao Zhu, Zeyu Sun, Xiran Liang, Yingfei Xiong, Lu Zhang
- Abstract summary: Given a natural language description, code retrieval aims to search for the most relevant code among a set of code.
Existing state-of-the-art approaches apply neural networks to code retrieval.
We propose a novel neural architecture named OCoR, where we introduce two specifically-designed components to capture overlaps.
- Score: 15.531119719750807
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Code retrieval helps developers reuse the code snippet in the open-source
projects. Given a natural language description, code retrieval aims to search
for the most relevant code among a set of code. Existing state-of-the-art
approaches apply neural networks to code retrieval. However, these approaches
still fail to capture an important feature: overlaps. The overlaps between
different names used by different people indicate that two different names may
be potentially related (e.g., "message" and "msg"), and the overlaps between
identifiers in code and words in natural language descriptions indicate that
the code snippet and the description may potentially be related. To address
these problems, we propose a novel neural architecture named OCoR, where we
introduce two specifically-designed components to capture overlaps: the first
embeds identifiers by character to capture the overlaps between identifiers,
and the second introduces a novel overlap matrix to represent the degrees of
overlaps between each natural language word and each identifier.
The evaluation was conducted on two established datasets. The experimental
results show that OCoR significantly outperforms the existing state-of-the-art
approaches and achieves 13.1% to 22.3% improvements. Moreover, we also
conducted several in-depth experiments to help understand the performance of
different components in OCoR.
Related papers
- Deep Code Search with Naming-Agnostic Contrastive Multi-View Learning [11.337238450492546]
We propose a naming-agnostic code search method (NACS) based on contrastive multi-view code representation learning.
NACS strips information bound to variable names from Abstract Syntax Tree (AST), the representation of the abstract syntactic structure of source code, and focuses on capturing intrinsic properties solely from AST structures.
arXiv Detail & Related papers (2024-08-18T03:47:34Z) - When simplicity meets effectiveness: Detecting code comments coherence with word embeddings and LSTM [6.417777780911223]
Code comments play a crucial role in software development, as they provide programmers with practical information.
Developers tend to leave comments unchanged after updating the code, resulting in a discrepancy between the two artifacts.
It is crucial to identify if, given a code snippet, its corresponding comment is coherent and reflects well the intent behind the code.
arXiv Detail & Related papers (2024-05-25T15:21:27Z) - Language Agnostic Code Embeddings [61.84835551549612]
We focus on the cross-lingual capabilities of code embeddings across different programming languages.
Code embeddings comprise two distinct components: one deeply tied to the nuances and syntax of a specific language, and the other remaining agnostic to these details.
We show that when we isolate and eliminate this language-specific component, we witness significant improvements in downstream code retrieval tasks.
arXiv Detail & Related papers (2023-10-25T17:34:52Z) - Language Models As Semantic Indexers [78.83425357657026]
We introduce LMIndexer, a self-supervised framework to learn semantic IDs with a generative language model.
We show the high quality of the learned IDs and demonstrate their effectiveness on three tasks including recommendation, product search, and document retrieval.
arXiv Detail & Related papers (2023-10-11T18:56:15Z) - Multiview Identifiers Enhanced Generative Retrieval [78.38443356800848]
generative retrieval generates identifier strings of passages as the retrieval target.
We propose a new type of identifier, synthetic identifiers, that are generated based on the content of a passage.
Our proposed approach performs the best in generative retrieval, demonstrating its effectiveness and robustness.
arXiv Detail & Related papers (2023-05-26T06:50:21Z) - CSSAM:Code Search via Attention Matching of Code Semantics and
Structures [8.547332796736107]
This paper introduces a code search model named CSSAM (Code Semantics and Structures Attention Matching)
By introducing semantic and structural matching mechanisms, CSSAM effectively extracts and fuses multidimensional code features.
By leveraging the residual interaction, a matching module is designed to preserve more code semantics and descriptive features.
arXiv Detail & Related papers (2022-08-08T05:45:40Z) - NS3: Neuro-Symbolic Semantic Code Search [33.583344165521645]
We use a Neural Module Network architecture to implement this idea.
We compare our model - NS3 (Neuro-Symbolic Semantic Search) - to a number of baselines, including state-of-the-art semantic code retrieval methods.
We demonstrate that our approach results in more precise code retrieval, and we study the effectiveness of our modular design when handling compositional queries.
arXiv Detail & Related papers (2022-05-21T20:55:57Z) - Label Semantics for Few Shot Named Entity Recognition [68.01364012546402]
We study the problem of few shot learning for named entity recognition.
We leverage the semantic information in the names of the labels as a way of giving the model additional signal and enriched priors.
Our model learns to match the representations of named entities computed by the first encoder with label representations computed by the second encoder.
arXiv Detail & Related papers (2022-03-16T23:21:05Z) - Deep Graph Matching and Searching for Semantic Code Retrieval [76.51445515611469]
We propose an end-to-end deep graph matching and searching model based on graph neural networks.
We first represent both natural language query texts and programming language code snippets with the unified graph-structured data.
In particular, DGMS not only captures more structural information for individual query texts or code snippets but also learns the fine-grained similarity between them.
arXiv Detail & Related papers (2020-10-24T14:16:50Z) - COSEA: Convolutional Code Search with Layer-wise Attention [90.35777733464354]
We propose a new deep learning architecture, COSEA, which leverages convolutional neural networks with layer-wise attention to capture the code's intrinsic structural logic.
COSEA can achieve significant improvements over state-of-the-art methods on code search tasks.
arXiv Detail & Related papers (2020-10-19T13:53:38Z) - Self-Supervised Contrastive Learning for Code Retrieval and
Summarization via Semantic-Preserving Transformations [28.61567319928316]
Corder is a self-supervised contrastive learning framework for source code model.
Key innovation is that we train the source code model by asking it to recognize similar and dissimilar code snippets.
We have shown that the code models pretrained by Corder substantially outperform the other baselines for code-to-code retrieval, text-to-code retrieval, and code-to-text summarization tasks.
arXiv Detail & Related papers (2020-09-06T13:31:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.