Related papers: Improving Code Example Recommendations on Informal Documentation Using BERT and Query-Aware LSH: A Comparative Study

Improving Code Example Recommendations on Informal Documentation Using BERT and Query-Aware LSH: A Comparative Study

URL: http://arxiv.org/abs/2305.03017v4
Date: Mon, 6 Nov 2023 17:50:37 GMT
Title: Improving Code Example Recommendations on Informal Documentation Using BERT and Query-Aware LSH: A Comparative Study
Authors: Sajjad Rahmani, AmirHossein Naghshzan, Latifa Guerrouj
Abstract summary: The focus of our study is Stack Overflow, a commonly used resource for coding discussions and solutions. We applied BERT, a powerful Large Language Model (LLM), to transform code examples into numerical vectors. Once these numerical representations are prepared, we identify Approximate Nearest Neighbors (ANN) using Locality-Sensitive Hashing (LSH) Our study revealed that the Query-Aware (QA) approach showed superior performance over the Random Hyperplane-based (RH) method.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Our research investigates the recommendation of code examples to aid software developers, a practice that saves developers significant time by providing ready-to-use code snippets. The focus of our study is Stack Overflow, a commonly used resource for coding discussions and solutions, particularly in the context of the Java programming language. We applied BERT, a powerful Large Language Model (LLM) that enables us to transform code examples into numerical vectors by extracting their semantic information. Once these numerical representations are prepared, we identify Approximate Nearest Neighbors (ANN) using Locality-Sensitive Hashing (LSH). Our research employed two variants of LSH: Random Hyperplane-based LSH and Query-Aware LSH. We rigorously compared these two approaches across four parameters: HitRate, Mean Reciprocal Rank (MRR), Average Execution Time, and Relevance. Our study revealed that the Query-Aware (QA) approach showed superior performance over the Random Hyperplane-based (RH) method. Specifically, it exhibited a notable improvement of 20\% to 35\% in HitRate for query pairs compared to the RH approach. Furthermore, the QA approach proved significantly more time-efficient, with its speed in creating hashing tables and assigning data samples to buckets being at least four times faster. It can return code examples within milliseconds, whereas the RH approach typically requires several seconds to recommend code examples. Due to the superior performance of the QA approach, we tested it against PostFinder and FaCoY, the state-of-the-art baselines. Our QA method showed comparable efficiency proving its potential for effective code recommendation.

Related papers

SECRET: Towards Scalable and Efficient Code Retrieval via Segmented Deep Hashing [83.35231185111464]
Deep learning has shifted the retrieval paradigm from lexical-based matching to encode source code and queries into vector representations. Previous research proposes deep hashing-based methods, which generate hash codes for queries and code snippets and use Hamming distance for rapid recall of code candidates. We propose a novel approach, which converts long hash codes calculated by existing deep hashing approaches into several short hash code segments through an iterative training strategy.
arXiv Detail & Related papers (2024-12-16T12:51:35Z)
Large Language Models Prompting With Episodic Memory [53.8690170372303]
We propose PrOmpting with Episodic Memory (POEM), a novel prompt optimization technique that is simple, efficient, and demonstrates strong generalization capabilities. In the testing phase, we optimize the sequence of examples for each test query by selecting the sequence that yields the highest total rewards from the top-k most similar training examples in the episodic memory. Our results show that POEM outperforms recent techniques like TEMPERA and RLPrompt by over 5.3% in various text classification tasks.
arXiv Detail & Related papers (2024-08-14T11:19:28Z)
LLM Agents Improve Semantic Code Search [6.047454623201181]
We introduce the approach of using Retrieval Augmented Generation powered agents to inject information into user prompts. By utilizing RAG, agents enhance user queries with relevant details from GitHub repositories, making them more informative and contextually aligned. Experimental results on the CodeSearchNet dataset demonstrate that RepoRift significantly outperforms existing methods.
arXiv Detail & Related papers (2024-08-05T00:43:56Z)
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph [83.90988015005934]
Uncertainty quantification is a key element of machine learning applications. We introduce a novel benchmark that implements a collection of state-of-the-art UQ baselines. We conduct a large-scale empirical investigation of UQ and normalization techniques across eleven tasks, identifying the most effective approaches.
arXiv Detail & Related papers (2024-06-21T20:06:31Z)
DeTriever: Decoder-representation-based Retriever for Improving NL2SQL In-Context Learning [19.93800175353809]
DeTriever is a novel demonstration retrieval framework that learns a weighted combination of hidden states. Our method significantly outperforms the state-of-the-art baselines on one-shot NL2 tasks.
arXiv Detail & Related papers (2024-06-12T06:33:54Z)
LlamaRec: Two-Stage Recommendation using Large Language Models for Ranking [10.671747198171136]
We propose a two-stage framework using large language models for ranking-based recommendation (LlamaRec) In particular, we use small-scale sequential recommenders to retrieve candidates based on the user interaction history. LlamaRec consistently achieves datasets superior performance in both recommendation performance and efficiency.
arXiv Detail & Related papers (2023-10-25T06:23:48Z)
An In-Context Learning Agent for Formal Theorem-Proving [10.657173216834668]
We present an in-context learning agent for formal theorem-context in environments like Lean and Coq. COPRA repeatedly asks a large language model to propose tactic applications from within a stateful backtracking search. We evaluate our implementation of COPRA on the miniF2F benchmark for Lean and a set of Coq tasks from the CompCert project.
arXiv Detail & Related papers (2023-10-06T16:21:22Z)
DQ-LoRe: Dual Queries with Low Rank Approximation Re-ranking for In-Context Learning [66.85379279041128]
In this study, we introduce a framework that leverages Dual Queries and Low-rank approximation Re-ranking to automatically select exemplars for in-context learning. DQ-LoRe significantly outperforms prior state-of-the-art methods in the automatic selection of exemplars for GPT-4, enhancing performance from 92.5% to 94.2%.
arXiv Detail & Related papers (2023-10-04T16:44:37Z)
REINFOREST: Reinforcing Semantic Code Similarity for Cross-Lingual Code Search Models [11.78036105494679]
This paper introduces a novel code-to-code search technique that enhances the performance of Large Language Models (LLMs) We present the first-ever code search method that encodes dynamic information during training without the need to execute either the corpus under search or the search query at inference time.
arXiv Detail & Related papers (2023-05-05T20:46:56Z)
Enhancing Semantic Code Search with Multimodal Contrastive Learning and Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search. We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z)
LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text Retrieval [55.097573036580066]
Experimental results show that LaPraDoR achieves state-of-the-art performance compared with supervised dense retrieval models. Compared to re-ranking, our lexicon-enhanced approach can be run in milliseconds (22.5x faster) while achieving superior performance.
arXiv Detail & Related papers (2022-03-11T18:53:12Z)
Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore. We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z)
Faster Person Re-Identification [68.22203008760269]
We introduce a new solution for fast ReID by formulating a novel Coarse-to-Fine hashing code search strategy. It uses shorter codes to coarsely rank broad matching similarities and longer codes to refine only a few top candidates for more accurate instance ReID. Experimental results on 2 datasets show that our proposed method (CtF) is not only 8% more accurate but also 5x faster than contemporary hashing ReID methods.
arXiv Detail & Related papers (2020-08-16T03:02:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.