Related papers: C-MinHash: Rigorously Reducing $K$ Permutations to Two

C-MinHash: Rigorously Reducing $K$ Permutations to Two

URL: http://arxiv.org/abs/2109.03337v1
Date: Tue, 7 Sep 2021 21:06:33 GMT
Title: C-MinHash: Rigorously Reducing $K$ Permutations to Two
Authors: Xiaoyun Li and Ping Li
Abstract summary: Minwise hashing (MinHash) is an important and practical algorithm for generating random hashes to approximate the Jaccard (resemblance) similarity in massive binary (0/1) data. We propose bf Circulant MinHash (C-MinHash) and provide the surprising theoretical results that we just need textbftwo independent random permutations.
Score: 25.356048456005023
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Minwise hashing (MinHash) is an important and practical algorithm for generating random hashes to approximate the Jaccard (resemblance) similarity in massive binary (0/1) data. The basic theory of MinHash requires applying hundreds or even thousands of independent random permutations to each data vector in the dataset, in order to obtain reliable results for (e.g.,) building large-scale learning models or approximate near neighbor search in massive data. In this paper, we propose {\bf Circulant MinHash (C-MinHash)} and provide the surprising theoretical results that we just need \textbf{two} independent random permutations. For C-MinHash, we first conduct an initial permutation on the data vector, then we use a second permutation to generate hash values. Basically, the second permutation is re-used $K$ times via circulant shifting to produce $K$ hashes. Unlike classical MinHash, these $K$ hashes are obviously correlated, but we are able to provide rigorous proofs that we still obtain an unbiased estimate of the Jaccard similarity and the theoretical variance is uniformly smaller than that of the classical MinHash with $K$ independent permutations. The theoretical proofs of C-MinHash require some non-trivial efforts. Numerical experiments are conducted to justify the theory and demonstrate the effectiveness of C-MinHash.

Related papers

Pb-Hash: Partitioned b-bit Hashing [21.607059258448594]
We propose to re-use the hashes by partitioning the $B$ bits into $m$ chunks, e.g., $btimes m =B$. Our theoretical analysis reveals that by partitioning the hash values into $m$ chunks, the accuracy would drop. In some regions, Pb-Hash still works well even for $m$ much larger than 4.
arXiv Detail & Related papers (2023-06-28T06:05:47Z)
Differentially Private One Permutation Hashing and Bin-wise Consistent Weighted Sampling [37.6593006747285]
Minwise hashing (MinHash) is a standard algorithm widely used in the industry for large-scale search and learning applications. One permutation hashing (OPH) is an efficient alternative of MinHash which splits the data vectors into $K$ bins and generates hash values within each bin. We propose DP-OPH framework with three variants: DP-OPH-fix, DP-OPH-re and DP-OPH-rand, depending on which densification strategy is adopted to deal with empty bins in OPH.
arXiv Detail & Related papers (2023-06-13T10:38:12Z)
Replicable Clustering [57.19013971737493]
We propose algorithms for the statistical $k$-medians, statistical $k$-means, and statistical $k$-centers problems by utilizing approximation routines for their counterparts in a black-box manner. We also provide experiments on synthetic distributions in 2D using the $k$-means++ implementation from sklearn as a black-box that validate our theoretical results.
arXiv Detail & Related papers (2023-02-20T23:29:43Z)
Private estimation algorithms for stochastic block models and mixture models [63.07482515700984]
General tools for designing efficient private estimation algorithms. First efficient $(epsilon, delta)$-differentially private algorithm for both weak recovery and exact recovery.
arXiv Detail & Related papers (2023-01-11T09:12:28Z)
A Lower Bound of Hash Codes' Performance [122.88252443695492]
In this paper, we prove that inter-class distinctiveness and intra-class compactness among hash codes determine the lower bound of hash codes' performance. We then propose a surrogate model to fully exploit the above objective by estimating the posterior of hash codes and controlling it, which results in a low-bias optimization. By testing on a series of hash-models, we obtain performance improvements among all of them, with an up to $26.5%$ increase in mean Average Precision and an up to $20.5%$ increase in accuracy.
arXiv Detail & Related papers (2022-10-12T03:30:56Z)
Learning to Hash Naturally Sorts [84.90210592082829]
We introduce Naturally-Sorted Hashing (NSH) to train a deep hashing model with sorted results end-to-end. NSH sort the Hamming distances of samples' hash codes and accordingly gather their latent representations for self-supervised training. We describe a novel Sorted Noise-Contrastive Estimation (SortedNCE) loss that selectively picks positive and negative samples for contrastive learning.
arXiv Detail & Related papers (2022-01-31T16:19:02Z)
C-OPH: Improving the Accuracy of One Permutation Hashing (OPH) with Circulant Permutations [25.356048456005023]
We propose to incorporate the essential ideas of C-MinHash to improve the accuracy of One Permutation Hashing. It can be shown that the estimation variance of the Jaccard similarity is strictly smaller than that of the existing (densified) OPH methods.
arXiv Detail & Related papers (2021-11-18T06:52:15Z)
C-MinHash: Practically Reducing Two Permutations to Just One [25.356048456005023]
Traditional minwise hashing (MinHash) requires applying $K$ independent permutations to estimate the Jaccard similarity in massive binary (0/1) data. Recent work on C-MinHash has shown that only two permutations are needed. One single permutation is used for both the initial pre-processing step to break the structures in the data and the circulant hashing step to generate $K$ hashes.
arXiv Detail & Related papers (2021-09-10T00:08:47Z)
List-Decodable Mean Estimation in Nearly-PCA Time [50.79691056481693]
We study the fundamental task of list-decodable mean estimation in high dimensions. Our algorithm runs in time $widetildeO(ndk)$ for all $k = O(sqrtd) cup Omega(d)$, where $n$ is the size of the dataset. A variant of our algorithm has runtime $widetildeO(ndk)$ for all $k$, at the expense of an $O(sqrtlog k)$ factor in the recovery guarantee
arXiv Detail & Related papers (2020-11-19T17:21:37Z)
CIMON: Towards High-quality Hash Codes [63.37321228830102]
We propose a new method named textbfComprehensive stextbfImilarity textbfMining and ctextbfOnsistency leartextbfNing (CIMON) First, we use global refinement and similarity statistical distribution to obtain reliable and smooth guidance. Second, both semantic and contrastive consistency learning are introduced to derive both disturb-invariant and discriminative hash codes.
arXiv Detail & Related papers (2020-10-15T14:47:14Z)
DartMinHash: Fast Sketching for Weighted Sets [0.5584060970507505]
Weighted minwise hashing is a standard dimensionality reduction technique with applications to similarity search and large-scale kernel machines. We introduce a simple algorithm that takes a weighted set $x in mathbbR_geq 0d$ and computes $k$ independent minhashes in expected time.
arXiv Detail & Related papers (2020-05-23T14:59:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.