DeepLSH: Deep Locality-Sensitive Hash Learning for Fast and Efficient
Near-Duplicate Crash Report Detection
- URL: http://arxiv.org/abs/2310.06703v1
- Date: Tue, 10 Oct 2023 15:26:27 GMT
- Title: DeepLSH: Deep Locality-Sensitive Hash Learning for Fast and Efficient
Near-Duplicate Crash Report Detection
- Authors: Youcef Remil and Anes Bendimerad and Romain Mathonat and Chedy Raissi
and Mehdi Kaytoue
- Abstract summary: With real-time streaming bug collection, systems are needed to quickly answer the question: What are the most similar bugs to a new one?, that is, efficiently find near-duplicates.
LSH has not been considered in the crash bucketing literature.
We introduce DeepLSH, a Siamese architecture with an original loss function, that perfectly approximates the locality-sensitivity property.
- Score: 0.29998889086656577
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic crash bucketing is a crucial phase in the software development
process for efficiently triaging bug reports. It generally consists in grouping
similar reports through clustering techniques. However, with real-time
streaming bug collection, systems are needed to quickly answer the question:
What are the most similar bugs to a new one?, that is, efficiently find
near-duplicates. It is thus natural to consider nearest neighbors search to
tackle this problem and especially the well-known locality-sensitive hashing
(LSH) to deal with large datasets due to its sublinear performance and
theoretical guarantees on the similarity search accuracy. Surprisingly, LSH has
not been considered in the crash bucketing literature. It is indeed not trivial
to derive hash functions that satisfy the so-called locality-sensitive property
for the most advanced crash bucketing metrics. Consequently, we study in this
paper how to leverage LSH for this task. To be able to consider the most
relevant metrics used in the literature, we introduce DeepLSH, a Siamese DNN
architecture with an original loss function, that perfectly approximates the
locality-sensitivity property even for Jaccard and Cosine metrics for which
exact LSH solutions exist. We support this claim with a series of experiments
on an original dataset, which we make available.
Related papers
- Fast Locality Sensitive Hashing with Theoretical Guarantee [5.635783105833339]
Locality-sensitive hashing (LSH) is an effective randomized technique widely used in many machine learning tasks.
In this paper, we design a simple yet efficient LSH scheme, named FastLSH, under l2 norm.
By combining random sampling and random projection, FastLSH reduces the time complexity from O(n) to O(m) (mn), where n is the data dimensionality and m is the number of sampled dimensions.
Experimental results demonstrate that FastLSH is on par with the state-of-the-arts in terms of answer quality, space occupation and query efficiency,
arXiv Detail & Related papers (2023-09-27T08:21:38Z) - Unsupervised Hashing with Similarity Distribution Calibration [127.34239817201549]
Unsupervised hashing methods aim to preserve the similarity between data points in a feature space by mapping them to binary hash codes.
These methods often overlook the fact that the similarity between data points in the continuous feature space may not be preserved in the discrete hash code space.
The similarity range is bounded by the code length and can lead to a problem known as similarity collapse.
This paper introduces a novel Similarity Distribution (SDC) method to alleviate this problem.
arXiv Detail & Related papers (2023-02-15T14:06:39Z) - Global Learnable Attention for Single Image Super-Resolution [68.2129989450593]
We propose a Global Learnable Attention (GLA) to adaptively modify similarity scores of non-local textures during training.
The GLA can explore non-local textures with low-similarity but more accurate details to repair severely damaged textures.
Based on the GLA, we constructed a Deep Learnable Similarity Network (DLSN) which achieves state-of-the-art performance for SISR tasks.
arXiv Detail & Related papers (2022-12-02T09:47:21Z) - Experimental Analysis of Machine Learning Techniques for Finding Search
Radius in Locality Sensitive Hashing [0.9137554315375919]
Locality Sensitive Hashing (LSH) is one of the most popular approximate nearest neighbor search techniques for high-dimensional spaces.
An improved LSH-based index structure, called radius-optimized Locality Sensitive Hashing (roLSH) has been proposed to utilize Machine Learning.
arXiv Detail & Related papers (2022-11-16T18:19:10Z) - SLOSH: Set LOcality Sensitive Hashing via Sliced-Wasserstein Embeddings [18.916058638077274]
This paper focuses on non-parametric and data-independent learning from set-structured data using approximate nearest neighbor (ANN) solutions.
We propose Sliced-Wasserstein set embedding as a computationally efficient "set-2-vector" mechanism that enables downstream ANN.
We demonstrate the effectiveness of our algorithm, denoted as Set-LOcality Sensitive Hashing (SLOSH), on various set retrieval datasets.
arXiv Detail & Related papers (2021-12-11T00:10:05Z) - A Fast Randomized Algorithm for Massive Text Normalization [26.602776972067936]
We present FLAN, a scalable randomized algorithm to clean and canonicalize massive text data.
Our algorithm relies on the Jaccard similarity between words to suggest correction results.
Our experimental results on real-world datasets demonstrate the efficiency and efficacy of FLAN.
arXiv Detail & Related papers (2021-10-06T19:18:17Z) - Learning to Hash Robustly, with Guarantees [79.68057056103014]
In this paper, we design an NNS algorithm for the Hamming space that has worst-case guarantees essentially matching that of theoretical algorithms.
We evaluate the algorithm's ability to optimize for a given dataset both theoretically and practically.
Our algorithm has a 1.8x and 2.1x better recall on the worst-performing queries to the MNIST and ImageNet datasets.
arXiv Detail & Related papers (2021-08-11T20:21:30Z) - IRLI: Iterative Re-partitioning for Learning to Index [104.72641345738425]
Methods have to trade between obtaining high accuracy while maintaining load balance and scalability in distributed settings.
We propose a novel approach called IRLI, which iteratively partitions the items by learning the relevant buckets directly from the query-item relevance data.
We mathematically show that IRLI retrieves the correct item with high probability under very natural assumptions and provides superior load balancing.
arXiv Detail & Related papers (2021-03-17T23:13:25Z) - CIMON: Towards High-quality Hash Codes [63.37321228830102]
We propose a new method named textbfComprehensive stextbfImilarity textbfMining and ctextbfOnsistency leartextbfNing (CIMON)
First, we use global refinement and similarity statistical distribution to obtain reliable and smooth guidance. Second, both semantic and contrastive consistency learning are introduced to derive both disturb-invariant and discriminative hash codes.
arXiv Detail & Related papers (2020-10-15T14:47:14Z) - Reinforcing Short-Length Hashing [61.75883795807109]
Existing methods have poor performance in retrieval using an extremely short-length hash code.
In this study, we propose a novel reinforcing short-length hashing (RSLH)
In this proposed RSLH, mutual reconstruction between the hash representation and semantic labels is performed to preserve the semantic information.
Experiments on three large-scale image benchmarks demonstrate the superior performance of RSLH under various short-length hashing scenarios.
arXiv Detail & Related papers (2020-04-24T02:23:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.