Industrial-Scale Neural Network Clone Detection with Disk-Based Similarity Search
- URL: http://arxiv.org/abs/2504.17972v1
- Date: Thu, 24 Apr 2025 22:50:23 GMT
- Title: Industrial-Scale Neural Network Clone Detection with Disk-Based Similarity Search
- Authors: Gul Aftab Ahmed, Muslim Chochlov, Abdul Razzaq, James Vincent Patten, Yuanhua Han, Guoxian Lu, Jim Buckley, David Gregg,
- Abstract summary: Code clones are similar code fragments that often arise from copy-and-paste programming.<n>We extend existing neural network-based clone detection schemes to handle clones that far exceed available memory.<n>We demonstrate that our approach is around 2$times$ slower than the in-memory approach for a problem size that can fit within memory.
- Score: 0.24091079613649843
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Code clones are similar code fragments that often arise from copy-and-paste programming. Neural networks can classify pairs of code fragments as clone/not-clone with high accuracy. However, finding clones in industrial-scale code needs a more scalable approach than pairwise comparison. We extend existing neural network-based clone detection schemes to handle codebases that far exceed available memory, using indexing and search methods for external storage such as disks and solid-state drives. We generate a high-dimensional vector embedding for each code fragment using a transformer-based neural network. We then find similar embeddings using efficient multidimensional nearest neighbor search algorithms on external storage to find similar embeddings without pairwise comparison. We identify specific problems with industrial-scale code bases, such as large sets of almost identical code fragments that interact poorly with $k$-nearest neighbour search algorithms, and provide an effective solution. We demonstrate that our disk-based clone search approach achieves similar clone detection accuracy as an equivalent in-memory technique. Using a solid-state drive as external storage, our approach is around 2$\times$ slower than the in-memory approach for a problem size that can fit within memory. We further demonstrate that our approach can scale to over a billion lines of code, providing valuable insights into the trade-offs between indexing speed, query performance, and storage efficiency for industrial-scale code clone detection.
Related papers
- Efficient Beam Search for Large Language Models Using Trie-Based Decoding [10.302821791274129]
We introduce a novel parallel decoding method that addresses the memory inefficiency of batch-based beam search.<n>By sharing a single cache among all beams that share the same prefix, the proposed method not only reduces memory consumption dramatically but also enables parallel decoding across all branches.<n>This innovative use of a prefix tree offers an efficient alternative for beam search, achieving significant memory savings while preserving inference speed, making it particularly well-suited for memory-constrained environments or large-scale model deployments.
arXiv Detail & Related papers (2025-01-31T16:22:36Z) - SECRET: Towards Scalable and Efficient Code Retrieval via Segmented Deep Hashing [83.35231185111464]
Deep learning has shifted the retrieval paradigm from lexical-based matching to encode source code and queries into vector representations.<n>Previous research proposes deep hashing-based methods, which generate hash codes for queries and code snippets and use Hamming distance for rapid recall of code candidates.<n>We propose a novel approach, which converts long hash codes calculated by existing deep hashing approaches into several short hash code segments through an iterative training strategy.
arXiv Detail & Related papers (2024-12-16T12:51:35Z) - CC2Vec: Combining Typed Tokens with Contrastive Learning for Effective Code Clone Detection [20.729032739935132]
CC2Vec is a novel code encoding method designed to swiftly identify simple code clones.
We evaluate CC2Vec on two widely used datasets (i.e., BigCloneBench and Google Code Jam)
arXiv Detail & Related papers (2024-05-01T10:18:31Z) - Using a Nearest-Neighbour, BERT-Based Approach for Scalable Clone
Detection [0.0]
SSCD is a BERT-based clone detection approach that targets high recall of Type 3 and Type 4 clones at scale.
It does so by computing a representative embedding for each code fragment and finding similar fragments using a nearest neighbour search.
This paper details the approach and an empirical assessment towards configuring and evaluating that approach in industrial setting.
arXiv Detail & Related papers (2023-09-05T12:38:55Z) - Efficient Nearest Neighbor Search for Cross-Encoder Models using Matrix
Factorization [60.91600465922932]
We present an approach that avoids the use of a dual-encoder for retrieval, relying solely on the cross-encoder.
Our approach provides test-time recall-vs-computational cost trade-offs superior to the current widely-used methods.
arXiv Detail & Related papers (2022-10-23T00:32:04Z) - Revisiting Code Search in a Two-Stage Paradigm [67.02322603435628]
TOSS is a two-stage fusion code search framework.
It first uses IR-based and bi-encoder models to efficiently recall a small number of top-k code candidates.
It then uses fine-grained cross-encoders for finer ranking.
arXiv Detail & Related papers (2022-08-24T02:34:27Z) - Rapid Person Re-Identification via Sub-space Consistency Regularization [51.76876061721556]
Person Re-Identification (ReID) matches pedestrians across disjoint cameras.
Existing ReID methods adopting real-value feature descriptors have achieved high accuracy, but they are low in efficiency due to the slow Euclidean distance computation.
We propose a novel Sub-space Consistency Regularization (SCR) algorithm that can speed up the ReID procedure by 0.25$ times.
arXiv Detail & Related papers (2022-07-13T02:44:05Z) - Evaluation of Contrastive Learning with Various Code Representations for
Code Clone Detection [3.699097874146491]
We evaluate contrastive learning for detecting semantic clones of code snippets.
We use CodeTransformator to create a dataset that mimics plagiarised code based on competitive programming solutions.
The results of our evaluation show that proposed models perform diversely in each task, however the performance of the graph-based models is generally above the others.
arXiv Detail & Related papers (2022-06-17T12:25:44Z) - Nearest neighbor search with compact codes: A decoder perspective [77.60612610421101]
We re-interpret popular methods such as binary hashing or product quantizers as auto-encoders.
We design backward-compatible decoders that improve the reconstruction of the vectors from the same codes.
arXiv Detail & Related papers (2021-12-17T15:22:28Z) - Semantic Clone Detection via Probabilistic Software Modeling [69.43451204725324]
This article contributes a semantic clone detection approach that detects clones that have 0% syntactic similarity.
We present SCD-PSM as a stable and precise solution to semantic clone detection.
arXiv Detail & Related papers (2020-08-11T17:54:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.