Differentially Private One Permutation Hashing and Bin-wise Consistent
Weighted Sampling
- URL: http://arxiv.org/abs/2306.07674v1
- Date: Tue, 13 Jun 2023 10:38:12 GMT
- Title: Differentially Private One Permutation Hashing and Bin-wise Consistent
Weighted Sampling
- Authors: Xiaoyun Li and Ping Li
- Abstract summary: Minwise hashing (MinHash) is a standard algorithm widely used in the industry for large-scale search and learning applications.
One permutation hashing (OPH) is an efficient alternative of MinHash which splits the data vectors into $K$ bins and generates hash values within each bin.
We propose DP-OPH framework with three variants: DP-OPH-fix, DP-OPH-re and DP-OPH-rand, depending on which densification strategy is adopted to deal with empty bins in OPH.
- Score: 37.6593006747285
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Minwise hashing (MinHash) is a standard algorithm widely used in the
industry, for large-scale search and learning applications with the binary
(0/1) Jaccard similarity. One common use of MinHash is for processing massive
n-gram text representations so that practitioners do not have to materialize
the original data (which would be prohibitive). Another popular use of MinHash
is for building hash tables to enable sub-linear time approximate near neighbor
(ANN) search. MinHash has also been used as a tool for building large-scale
machine learning systems. The standard implementation of MinHash requires
applying $K$ random permutations. In comparison, the method of one permutation
hashing (OPH), is an efficient alternative of MinHash which splits the data
vectors into $K$ bins and generates hash values within each bin. OPH is
substantially more efficient and also more convenient to use.
In this paper, we combine the differential privacy (DP) with OPH (as well as
MinHash), to propose the DP-OPH framework with three variants: DP-OPH-fix,
DP-OPH-re and DP-OPH-rand, depending on which densification strategy is adopted
to deal with empty bins in OPH. A detailed roadmap to the algorithm design is
presented along with the privacy analysis. An analytical comparison of our
proposed DP-OPH methods with the DP minwise hashing (DP-MH) is provided to
justify the advantage of DP-OPH. Experiments on similarity search confirm the
merits of DP-OPH, and guide the choice of the proper variant in different
practical scenarios. Our technique is also extended to bin-wise consistent
weighted sampling (BCWS) to develop a new DP algorithm called DP-BCWS for
non-binary data. Experiments on classification tasks demonstrate that DP-BCWS
is able to achieve excellent utility at around $\epsilon = 5\sim 10$, where
$\epsilon$ is the standard parameter in the language of $(\epsilon,
\delta)$-DP.
Related papers
- Pb-Hash: Partitioned b-bit Hashing [21.607059258448594]
We propose to re-use the hashes by partitioning the $B$ bits into $m$ chunks, e.g., $btimes m =B$.
Our theoretical analysis reveals that by partitioning the hash values into $m$ chunks, the accuracy would drop.
In some regions, Pb-Hash still works well even for $m$ much larger than 4.
arXiv Detail & Related papers (2023-06-28T06:05:47Z) - A Lower Bound of Hash Codes' Performance [122.88252443695492]
In this paper, we prove that inter-class distinctiveness and intra-class compactness among hash codes determine the lower bound of hash codes' performance.
We then propose a surrogate model to fully exploit the above objective by estimating the posterior of hash codes and controlling it, which results in a low-bias optimization.
By testing on a series of hash-models, we obtain performance improvements among all of them, with an up to $26.5%$ increase in mean Average Precision and an up to $20.5%$ increase in accuracy.
arXiv Detail & Related papers (2022-10-12T03:30:56Z) - Learning to Hash Naturally Sorts [84.90210592082829]
We introduce Naturally-Sorted Hashing (NSH) to train a deep hashing model with sorted results end-to-end.
NSH sort the Hamming distances of samples' hash codes and accordingly gather their latent representations for self-supervised training.
We describe a novel Sorted Noise-Contrastive Estimation (SortedNCE) loss that selectively picks positive and negative samples for contrastive learning.
arXiv Detail & Related papers (2022-01-31T16:19:02Z) - C-OPH: Improving the Accuracy of One Permutation Hashing (OPH) with
Circulant Permutations [25.356048456005023]
We propose to incorporate the essential ideas of C-MinHash to improve the accuracy of One Permutation Hashing.
It can be shown that the estimation variance of the Jaccard similarity is strictly smaller than that of the existing (densified) OPH methods.
arXiv Detail & Related papers (2021-11-18T06:52:15Z) - C-MinHash: Rigorously Reducing $K$ Permutations to Two [25.356048456005023]
Minwise hashing (MinHash) is an important and practical algorithm for generating random hashes to approximate the Jaccard (resemblance) similarity in massive binary (0/1) data.
We propose bf Circulant MinHash (C-MinHash) and provide the surprising theoretical results that we just need textbftwo independent random permutations.
arXiv Detail & Related papers (2021-09-07T21:06:33Z) - Unsupervised Multi-Index Semantic Hashing [23.169142004594434]
We propose an unsupervised hashing model that learns hash codes that are both effective and highly efficient by being optimized for multi-index hashing.
We experimentally compare MISH to state-of-the-art semantic hashing baselines in the task of document similarity search.
We find that even though multi-index hashing also improves the efficiency of the baselines compared to a linear scan, they are still upwards of 33% slower than MISH.
arXiv Detail & Related papers (2021-03-26T13:33:48Z) - CIMON: Towards High-quality Hash Codes [63.37321228830102]
We propose a new method named textbfComprehensive stextbfImilarity textbfMining and ctextbfOnsistency leartextbfNing (CIMON)
First, we use global refinement and similarity statistical distribution to obtain reliable and smooth guidance. Second, both semantic and contrastive consistency learning are introduced to derive both disturb-invariant and discriminative hash codes.
arXiv Detail & Related papers (2020-10-15T14:47:14Z) - Deep Hashing with Hash-Consistent Large Margin Proxy Embeddings [65.36757931982469]
Image hash codes are produced by binarizing embeddings of convolutional neural networks (CNN) trained for either classification or retrieval.
The use of a fixed set of proxies (weights of the CNN classification layer) is proposed to eliminate this ambiguity.
The resulting hash-consistent large margin (HCLM) proxies are shown to encourage saturation of hashing units, thus guaranteeing a small binarization error.
arXiv Detail & Related papers (2020-07-27T23:47:43Z) - Reinforcing Short-Length Hashing [61.75883795807109]
Existing methods have poor performance in retrieval using an extremely short-length hash code.
In this study, we propose a novel reinforcing short-length hashing (RSLH)
In this proposed RSLH, mutual reconstruction between the hash representation and semantic labels is performed to preserve the semantic information.
Experiments on three large-scale image benchmarks demonstrate the superior performance of RSLH under various short-length hashing scenarios.
arXiv Detail & Related papers (2020-04-24T02:23:52Z) - Bio-Inspired Hashing for Unsupervised Similarity Search [12.78093617645299]
We propose a novel hashing algorithm BioHash that produces sparse high dimensional hash codes in a data-driven manner.
Our work provides evidence for the proposal that LSH might be a computational reason for the abundance of sparse expansive motifs in a variety of biological systems.
arXiv Detail & Related papers (2020-01-14T17:04:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.