C-MinHash: Practically Reducing Two Permutations to Just One
- URL: http://arxiv.org/abs/2109.04595v1
- Date: Fri, 10 Sep 2021 00:08:47 GMT
- Title: C-MinHash: Practically Reducing Two Permutations to Just One
- Authors: Xiaoyun Li and Ping Li
- Abstract summary: Traditional minwise hashing (MinHash) requires applying $K$ independent permutations to estimate the Jaccard similarity in massive binary (0/1) data.
Recent work on C-MinHash has shown that only two permutations are needed.
One single permutation is used for both the initial pre-processing step to break the structures in the data and the circulant hashing step to generate $K$ hashes.
- Score: 25.356048456005023
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Traditional minwise hashing (MinHash) requires applying $K$ independent
permutations to estimate the Jaccard similarity in massive binary (0/1) data,
where $K$ can be (e.g.,) 1024 or even larger, depending on applications. The
recent work on C-MinHash (Li and Li, 2021) has shown, with rigorous proofs,
that only two permutations are needed. An initial permutation is applied to
break whatever structures which might exist in the data, and a second
permutation is re-used $K$ times to produce $K$ hashes, via a circulant
shifting fashion. (Li and Li, 2021) has proved that, perhaps surprisingly, even
though the $K$ hashes are correlated, the estimation variance is strictly
smaller than the variance of the traditional MinHash.
It has been demonstrated in (Li and Li, 2021) that the initial permutation in
C-MinHash is indeed necessary. For the ease of theoretical analysis, they have
used two independent permutations. In this paper, we show that one can actually
simply use one permutation. That is, one single permutation is used for both
the initial pre-processing step to break the structures in the data and the
circulant hashing step to generate $K$ hashes. Although the theoretical
analysis becomes very complicated, we are able to explicitly write down the
expression for the expectation of the estimator. The new estimator is no longer
unbiased but the bias is extremely small and has essentially no impact on the
estimation accuracy (mean square errors). An extensive set of experiments are
provided to verify our claim for using just one permutation.
Related papers
- Quantum One-Wayness of the Single-Round Sponge with Invertible Permutations [49.1574468325115]
Sponge hashing is a widely used class of cryptographic hash algorithms.
Intrepid permutations have so far remained a fundamental open problem.
We show that finding zero-pairs in a random $2n$-bit permutation requires at least $Omega (2n/2)$ many queries.
arXiv Detail & Related papers (2024-03-07T18:46:58Z) - Pb-Hash: Partitioned b-bit Hashing [21.607059258448594]
We propose to re-use the hashes by partitioning the $B$ bits into $m$ chunks, e.g., $btimes m =B$.
Our theoretical analysis reveals that by partitioning the hash values into $m$ chunks, the accuracy would drop.
In some regions, Pb-Hash still works well even for $m$ much larger than 4.
arXiv Detail & Related papers (2023-06-28T06:05:47Z) - Differentially Private One Permutation Hashing and Bin-wise Consistent
Weighted Sampling [37.6593006747285]
Minwise hashing (MinHash) is a standard algorithm widely used in the industry for large-scale search and learning applications.
One permutation hashing (OPH) is an efficient alternative of MinHash which splits the data vectors into $K$ bins and generates hash values within each bin.
We propose DP-OPH framework with three variants: DP-OPH-fix, DP-OPH-re and DP-OPH-rand, depending on which densification strategy is adopted to deal with empty bins in OPH.
arXiv Detail & Related papers (2023-06-13T10:38:12Z) - Replicable Clustering [57.19013971737493]
We propose algorithms for the statistical $k$-medians, statistical $k$-means, and statistical $k$-centers problems by utilizing approximation routines for their counterparts in a black-box manner.
We also provide experiments on synthetic distributions in 2D using the $k$-means++ implementation from sklearn as a black-box that validate our theoretical results.
arXiv Detail & Related papers (2023-02-20T23:29:43Z) - Private estimation algorithms for stochastic block models and mixture
models [63.07482515700984]
General tools for designing efficient private estimation algorithms.
First efficient $(epsilon, delta)$-differentially private algorithm for both weak recovery and exact recovery.
arXiv Detail & Related papers (2023-01-11T09:12:28Z) - A Permutation-Free Kernel Independence Test [36.50719125230106]
In nonparametric independence testing, we observe i.i.d. data $(X_i,Y_i)_i=1n$, where $X in mathcalX, Y in mathcalY$ lie in any general spaces.
Modern test statistics such as the kernel Hilbert-Schmidt Independence Criterion (HSIC) and Distance Covariance (dCov) have intractable null distributions due to the degeneracy of the underlying U-statistics.
This paper provides a simple but nontrivial modification of HSIC
arXiv Detail & Related papers (2022-12-18T15:28:16Z) - Quantum Resources Required to Block-Encode a Matrix of Classical Data [56.508135743727934]
We provide circuit-level implementations and resource estimates for several methods of block-encoding a dense $Ntimes N$ matrix of classical data to precision $epsilon$.
We examine resource tradeoffs between the different approaches and explore implementations of two separate models of quantum random access memory (QRAM)
Our results go beyond simple query complexity and provide a clear picture into the resource costs when large amounts of classical data are assumed to be accessible to quantum algorithms.
arXiv Detail & Related papers (2022-06-07T18:00:01Z) - C-OPH: Improving the Accuracy of One Permutation Hashing (OPH) with
Circulant Permutations [25.356048456005023]
We propose to incorporate the essential ideas of C-MinHash to improve the accuracy of One Permutation Hashing.
It can be shown that the estimation variance of the Jaccard similarity is strictly smaller than that of the existing (densified) OPH methods.
arXiv Detail & Related papers (2021-11-18T06:52:15Z) - C-MinHash: Rigorously Reducing $K$ Permutations to Two [25.356048456005023]
Minwise hashing (MinHash) is an important and practical algorithm for generating random hashes to approximate the Jaccard (resemblance) similarity in massive binary (0/1) data.
We propose bf Circulant MinHash (C-MinHash) and provide the surprising theoretical results that we just need textbftwo independent random permutations.
arXiv Detail & Related papers (2021-09-07T21:06:33Z) - List-Decodable Mean Estimation in Nearly-PCA Time [50.79691056481693]
We study the fundamental task of list-decodable mean estimation in high dimensions.
Our algorithm runs in time $widetildeO(ndk)$ for all $k = O(sqrtd) cup Omega(d)$, where $n$ is the size of the dataset.
A variant of our algorithm has runtime $widetildeO(ndk)$ for all $k$, at the expense of an $O(sqrtlog k)$ factor in the recovery guarantee
arXiv Detail & Related papers (2020-11-19T17:21:37Z) - Robustly Learning any Clusterable Mixture of Gaussians [55.41573600814391]
We study the efficient learnability of high-dimensional Gaussian mixtures in the adversarial-robust setting.
We provide an algorithm that learns the components of an $epsilon$-corrupted $k$-mixture within information theoretically near-optimal error proofs of $tildeO(epsilon)$.
Our main technical contribution is a new robust identifiability proof clusters from a Gaussian mixture, which can be captured by the constant-degree Sum of Squares proof system.
arXiv Detail & Related papers (2020-05-13T16:44:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.