C-MinHash: Rigorously Reducing $K$ Permutations to Two
- URL: http://arxiv.org/abs/2109.03337v1
- Date: Tue, 7 Sep 2021 21:06:33 GMT
- Title: C-MinHash: Rigorously Reducing $K$ Permutations to Two
- Authors: Xiaoyun Li and Ping Li
- Abstract summary: Minwise hashing (MinHash) is an important and practical algorithm for generating random hashes to approximate the Jaccard (resemblance) similarity in massive binary (0/1) data.
We propose bf Circulant MinHash (C-MinHash) and provide the surprising theoretical results that we just need textbftwo independent random permutations.
- Score: 25.356048456005023
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Minwise hashing (MinHash) is an important and practical algorithm for
generating random hashes to approximate the Jaccard (resemblance) similarity in
massive binary (0/1) data. The basic theory of MinHash requires applying
hundreds or even thousands of independent random permutations to each data
vector in the dataset, in order to obtain reliable results for (e.g.,) building
large-scale learning models or approximate near neighbor search in massive
data. In this paper, we propose {\bf Circulant MinHash (C-MinHash)} and provide
the surprising theoretical results that we just need \textbf{two} independent
random permutations. For C-MinHash, we first conduct an initial permutation on
the data vector, then we use a second permutation to generate hash values.
Basically, the second permutation is re-used $K$ times via circulant shifting
to produce $K$ hashes. Unlike classical MinHash, these $K$ hashes are obviously
correlated, but we are able to provide rigorous proofs that we still obtain an
unbiased estimate of the Jaccard similarity and the theoretical variance is
uniformly smaller than that of the classical MinHash with $K$ independent
permutations. The theoretical proofs of C-MinHash require some non-trivial
efforts. Numerical experiments are conducted to justify the theory and
demonstrate the effectiveness of C-MinHash.
Related papers
- Pb-Hash: Partitioned b-bit Hashing [21.607059258448594]
We propose to re-use the hashes by partitioning the $B$ bits into $m$ chunks, e.g., $btimes m =B$.
Our theoretical analysis reveals that by partitioning the hash values into $m$ chunks, the accuracy would drop.
In some regions, Pb-Hash still works well even for $m$ much larger than 4.
arXiv Detail & Related papers (2023-06-28T06:05:47Z) - Differentially Private One Permutation Hashing and Bin-wise Consistent
Weighted Sampling [37.6593006747285]
Minwise hashing (MinHash) is a standard algorithm widely used in the industry for large-scale search and learning applications.
One permutation hashing (OPH) is an efficient alternative of MinHash which splits the data vectors into $K$ bins and generates hash values within each bin.
We propose DP-OPH framework with three variants: DP-OPH-fix, DP-OPH-re and DP-OPH-rand, depending on which densification strategy is adopted to deal with empty bins in OPH.
arXiv Detail & Related papers (2023-06-13T10:38:12Z) - Replicable Clustering [57.19013971737493]
We propose algorithms for the statistical $k$-medians, statistical $k$-means, and statistical $k$-centers problems by utilizing approximation routines for their counterparts in a black-box manner.
We also provide experiments on synthetic distributions in 2D using the $k$-means++ implementation from sklearn as a black-box that validate our theoretical results.
arXiv Detail & Related papers (2023-02-20T23:29:43Z) - Private estimation algorithms for stochastic block models and mixture
models [63.07482515700984]
General tools for designing efficient private estimation algorithms.
First efficient $(epsilon, delta)$-differentially private algorithm for both weak recovery and exact recovery.
arXiv Detail & Related papers (2023-01-11T09:12:28Z) - A Lower Bound of Hash Codes' Performance [122.88252443695492]
In this paper, we prove that inter-class distinctiveness and intra-class compactness among hash codes determine the lower bound of hash codes' performance.
We then propose a surrogate model to fully exploit the above objective by estimating the posterior of hash codes and controlling it, which results in a low-bias optimization.
By testing on a series of hash-models, we obtain performance improvements among all of them, with an up to $26.5%$ increase in mean Average Precision and an up to $20.5%$ increase in accuracy.
arXiv Detail & Related papers (2022-10-12T03:30:56Z) - Learning to Hash Naturally Sorts [84.90210592082829]
We introduce Naturally-Sorted Hashing (NSH) to train a deep hashing model with sorted results end-to-end.
NSH sort the Hamming distances of samples' hash codes and accordingly gather their latent representations for self-supervised training.
We describe a novel Sorted Noise-Contrastive Estimation (SortedNCE) loss that selectively picks positive and negative samples for contrastive learning.
arXiv Detail & Related papers (2022-01-31T16:19:02Z) - C-OPH: Improving the Accuracy of One Permutation Hashing (OPH) with
Circulant Permutations [25.356048456005023]
We propose to incorporate the essential ideas of C-MinHash to improve the accuracy of One Permutation Hashing.
It can be shown that the estimation variance of the Jaccard similarity is strictly smaller than that of the existing (densified) OPH methods.
arXiv Detail & Related papers (2021-11-18T06:52:15Z) - C-MinHash: Practically Reducing Two Permutations to Just One [25.356048456005023]
Traditional minwise hashing (MinHash) requires applying $K$ independent permutations to estimate the Jaccard similarity in massive binary (0/1) data.
Recent work on C-MinHash has shown that only two permutations are needed.
One single permutation is used for both the initial pre-processing step to break the structures in the data and the circulant hashing step to generate $K$ hashes.
arXiv Detail & Related papers (2021-09-10T00:08:47Z) - List-Decodable Mean Estimation in Nearly-PCA Time [50.79691056481693]
We study the fundamental task of list-decodable mean estimation in high dimensions.
Our algorithm runs in time $widetildeO(ndk)$ for all $k = O(sqrtd) cup Omega(d)$, where $n$ is the size of the dataset.
A variant of our algorithm has runtime $widetildeO(ndk)$ for all $k$, at the expense of an $O(sqrtlog k)$ factor in the recovery guarantee
arXiv Detail & Related papers (2020-11-19T17:21:37Z) - CIMON: Towards High-quality Hash Codes [63.37321228830102]
We propose a new method named textbfComprehensive stextbfImilarity textbfMining and ctextbfOnsistency leartextbfNing (CIMON)
First, we use global refinement and similarity statistical distribution to obtain reliable and smooth guidance. Second, both semantic and contrastive consistency learning are introduced to derive both disturb-invariant and discriminative hash codes.
arXiv Detail & Related papers (2020-10-15T14:47:14Z) - DartMinHash: Fast Sketching for Weighted Sets [0.5584060970507505]
Weighted minwise hashing is a standard dimensionality reduction technique with applications to similarity search and large-scale kernel machines.
We introduce a simple algorithm that takes a weighted set $x in mathbbR_geq 0d$ and computes $k$ independent minhashes in expected time.
arXiv Detail & Related papers (2020-05-23T14:59:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.