Improving Code Example Recommendations on Informal Documentation Using
BERT and Query-Aware LSH: A Comparative Study
- URL: http://arxiv.org/abs/2305.03017v4
- Date: Mon, 6 Nov 2023 17:50:37 GMT
- Title: Improving Code Example Recommendations on Informal Documentation Using
BERT and Query-Aware LSH: A Comparative Study
- Authors: Sajjad Rahmani, AmirHossein Naghshzan, Latifa Guerrouj
- Abstract summary: The focus of our study is Stack Overflow, a commonly used resource for coding discussions and solutions.
We applied BERT, a powerful Large Language Model (LLM), to transform code examples into numerical vectors.
Once these numerical representations are prepared, we identify Approximate Nearest Neighbors (ANN) using Locality-Sensitive Hashing (LSH)
Our study revealed that the Query-Aware (QA) approach showed superior performance over the Random Hyperplane-based (RH) method.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Our research investigates the recommendation of code examples to aid software
developers, a practice that saves developers significant time by providing
ready-to-use code snippets. The focus of our study is Stack Overflow, a
commonly used resource for coding discussions and solutions, particularly in
the context of the Java programming language. We applied BERT, a powerful Large
Language Model (LLM) that enables us to transform code examples into numerical
vectors by extracting their semantic information. Once these numerical
representations are prepared, we identify Approximate Nearest Neighbors (ANN)
using Locality-Sensitive Hashing (LSH). Our research employed two variants of
LSH: Random Hyperplane-based LSH and Query-Aware LSH. We rigorously compared
these two approaches across four parameters: HitRate, Mean Reciprocal Rank
(MRR), Average Execution Time, and Relevance. Our study revealed that the
Query-Aware (QA) approach showed superior performance over the Random
Hyperplane-based (RH) method. Specifically, it exhibited a notable improvement
of 20\% to 35\% in HitRate for query pairs compared to the RH approach.
Furthermore, the QA approach proved significantly more time-efficient, with its
speed in creating hashing tables and assigning data samples to buckets being at
least four times faster. It can return code examples within milliseconds,
whereas the RH approach typically requires several seconds to recommend code
examples. Due to the superior performance of the QA approach, we tested it
against PostFinder and FaCoY, the state-of-the-art baselines. Our QA method
showed comparable efficiency proving its potential for effective code
recommendation.
Related papers
- Large Language Models Prompting With Episodic Memory [53.8690170372303]
We propose PrOmpting with Episodic Memory (POEM), a novel prompt optimization technique that is simple, efficient, and demonstrates strong generalization capabilities.
In the testing phase, we optimize the sequence of examples for each test query by selecting the sequence that yields the highest total rewards from the top-k most similar training examples in the episodic memory.
Our results show that POEM outperforms recent techniques like TEMPERA and RLPrompt by over 5.3% in various text classification tasks.
arXiv Detail & Related papers (2024-08-14T11:19:28Z) - LLM Agents Improve Semantic Code Search [6.047454623201181]
We introduce the approach of using Retrieval Augmented Generation powered agents to inject information into user prompts.
By utilizing RAG, agents enhance user queries with relevant details from GitHub repositories, making them more informative and contextually aligned.
Experimental results on the CodeSearchNet dataset demonstrate that RepoRift significantly outperforms existing methods.
arXiv Detail & Related papers (2024-08-05T00:43:56Z) - DeTriever: Decoder-representation-based Retriever for Improving NL2SQL In-Context Learning [19.93800175353809]
DeTriever is a novel demonstration retrieval framework that learns a weighted combination of hidden states.
Our method significantly outperforms the state-of-the-art baselines on one-shot NL2 tasks.
arXiv Detail & Related papers (2024-06-12T06:33:54Z) - LlamaRec: Two-Stage Recommendation using Large Language Models for
Ranking [10.671747198171136]
We propose a two-stage framework using large language models for ranking-based recommendation (LlamaRec)
In particular, we use small-scale sequential recommenders to retrieve candidates based on the user interaction history.
LlamaRec consistently achieves datasets superior performance in both recommendation performance and efficiency.
arXiv Detail & Related papers (2023-10-25T06:23:48Z) - An In-Context Learning Agent for Formal Theorem-Proving [10.657173216834668]
We present an in-context learning agent for formal theorem-context in environments like Lean and Coq.
COPRA repeatedly asks a large language model to propose tactic applications from within a stateful backtracking search.
We evaluate our implementation of COPRA on the miniF2F benchmark for Lean and a set of Coq tasks from the CompCert project.
arXiv Detail & Related papers (2023-10-06T16:21:22Z) - DQ-LoRe: Dual Queries with Low Rank Approximation Re-ranking for
In-Context Learning [66.85379279041128]
In this study, we introduce a framework that leverages Dual Queries and Low-rank approximation Re-ranking to automatically select exemplars for in-context learning.
DQ-LoRe significantly outperforms prior state-of-the-art methods in the automatic selection of exemplars for GPT-4, enhancing performance from 92.5% to 94.2%.
arXiv Detail & Related papers (2023-10-04T16:44:37Z) - REINFOREST: Reinforcing Semantic Code Similarity for Cross-Lingual Code Search Models [11.78036105494679]
This paper introduces a novel code-to-code search technique that enhances the performance of Large Language Models (LLMs)
We present the first-ever code search method that encodes dynamic information during training without the need to execute either the corpus under search or the search query at inference time.
arXiv Detail & Related papers (2023-05-05T20:46:56Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text
Retrieval [55.097573036580066]
Experimental results show that LaPraDoR achieves state-of-the-art performance compared with supervised dense retrieval models.
Compared to re-ranking, our lexicon-enhanced approach can be run in milliseconds (22.5x faster) while achieving superior performance.
arXiv Detail & Related papers (2022-03-11T18:53:12Z) - Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore.
We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z) - Faster Person Re-Identification [68.22203008760269]
We introduce a new solution for fast ReID by formulating a novel Coarse-to-Fine hashing code search strategy.
It uses shorter codes to coarsely rank broad matching similarities and longer codes to refine only a few top candidates for more accurate instance ReID.
Experimental results on 2 datasets show that our proposed method (CtF) is not only 8% more accurate but also 5x faster than contemporary hashing ReID methods.
arXiv Detail & Related papers (2020-08-16T03:02:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.