DeSkew-LSH based Code-to-Code Recommendation Engine
- URL: http://arxiv.org/abs/2111.04473v1
- Date: Fri, 5 Nov 2021 16:56:28 GMT
- Title: DeSkew-LSH based Code-to-Code Recommendation Engine
- Authors: Fran Silavong, Sean Moran, Antonios Georgiadis, Rohan Saphal, Robert
Otter
- Abstract summary: We present emphSenatus, a new code-to-code recommendation engine for machine learning on source code.
At the core of Senatus is emphDe-Skew LSH, a new locality sensitive hashing algorithm that indexes the data for fast (sub-linear time) retrieval.
We show Senatus improves performance by 6.7% F1 and query time 16x is faster compared to Facebook Aroma on the task of code-to-code recommendation.
- Score: 3.7011129410662558
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine learning on source code (MLOnCode) is a popular research field that
has been driven by the availability of large-scale code repositories and the
development of powerful probabilistic and deep learning models for mining
source code. Code-to-code recommendation is a task in MLOnCode that aims to
recommend relevant, diverse and concise code snippets that usefully extend the
code currently being written by a developer in their development environment
(IDE). Code-to-code recommendation engines hold the promise of increasing
developer productivity by reducing context switching from the IDE and
increasing code-reuse. Existing code-to-code recommendation engines do not
scale gracefully to large codebases, exhibiting a linear growth in query time
as the code repository increases in size. In addition, existing code-to-code
recommendation engines fail to account for the global statistics of code
repositories in the ranking function, such as the distribution of code snippet
lengths, leading to sub-optimal retrieval results. We address both of these
weaknesses with \emph{Senatus}, a new code-to-code recommendation engine. At
the core of Senatus is \emph{De-Skew} LSH a new locality sensitive hashing
(LSH) algorithm that indexes the data for fast (sub-linear time) retrieval
while also counteracting the skewness in the snippet length distribution using
novel abstract syntax tree-based feature scoring and selection algorithms. We
evaluate Senatus via automatic evaluation and with an expert developer user
study and find the recommendations to be of higher quality than competing
baselines, while achieving faster search. For example, on the CodeSearchNet
dataset we show that Senatus improves performance by 6.7\% F1 and query time
16x is faster compared to Facebook Aroma on the task of code-to-code
recommendation.
Related papers
- When to Stop? Towards Efficient Code Generation in LLMs with Excess Token Prevention [43.39584272739589]
We introduce CodeFast, an inference acceleration approach for Code LLMs on code generation.
Key idea of CodeFast is to terminate the inference process in time when unnecessary excess tokens are detected.
We conduct extensive experiments with CodeFast on five representative Code LLMs across four widely used code generation datasets.
arXiv Detail & Related papers (2024-07-29T14:27:08Z) - VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development.
We propose two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM)
We conduct an extensive evaluation on VersiCode, which reveals that version-controllable code generation is indeed a significant challenge.
arXiv Detail & Related papers (2024-06-11T16:15:06Z) - Rewriting the Code: A Simple Method for Large Language Model Augmented Code Search [7.822427053078387]
Generation-Augmented Retrieval (GAR) framework generates exemplar code snippets to augment queries.
We propose a simple yet effective method that additionally Rewrites the Code (ReCo) within the for style normalization.
Code Style Similarity is the first metric tailored to quantify stylistic similarities in code.
arXiv Detail & Related papers (2024-01-09T12:12:50Z) - A^3-CodGen: A Repository-Level Code Generation Framework for Code Reuse with Local-Aware, Global-Aware, and Third-Party-Library-Aware [13.27883339389175]
We propose a novel code generation framework, dubbed A3-CodGen, to harness information within the code repository to generate code with fewer potential logical errors.
Results demonstrate that by adopting the A3-CodGen framework, we successfully extract, fuse, and feed code repository information into the LLM, generating more accurate, efficient, and highly reusable code.
arXiv Detail & Related papers (2023-12-10T05:36:06Z) - Tackling Long Code Search with Splitting, Encoding, and Aggregating [67.02322603435628]
We propose a new baseline SEA (Split, Encode and Aggregate) for long code search.
It splits long code into code blocks, encodes these blocks into embeddings, and aggregates them to obtain a comprehensive long code representation.
With GraphCodeBERT as the encoder, SEA achieves an overall mean reciprocal ranking score of 0.785, which is 10.1% higher than GraphCodeBERT on the CodeSearchNet benchmark.
arXiv Detail & Related papers (2022-08-24T02:27:30Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - CodeRetriever: Unimodal and Bimodal Contrastive Learning [128.06072658302165]
We propose the CodeRetriever model, which combines the unimodal and bimodal contrastive learning to train function-level code semantic representations.
For unimodal contrastive learning, we design a semantic-guided method to build positive code pairs based on the documentation and function name.
For bimodal contrastive learning, we leverage the documentation and in-line comments of code to build text-code pairs.
arXiv Detail & Related papers (2022-01-26T10:54:30Z) - COSEA: Convolutional Code Search with Layer-wise Attention [90.35777733464354]
We propose a new deep learning architecture, COSEA, which leverages convolutional neural networks with layer-wise attention to capture the code's intrinsic structural logic.
COSEA can achieve significant improvements over state-of-the-art methods on code search tasks.
arXiv Detail & Related papers (2020-10-19T13:53:38Z) - Faster Person Re-Identification [68.22203008760269]
We introduce a new solution for fast ReID by formulating a novel Coarse-to-Fine hashing code search strategy.
It uses shorter codes to coarsely rank broad matching similarities and longer codes to refine only a few top candidates for more accurate instance ReID.
Experimental results on 2 datasets show that our proposed method (CtF) is not only 8% more accurate but also 5x faster than contemporary hashing ReID methods.
arXiv Detail & Related papers (2020-08-16T03:02:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.