C-OPH: Improving the Accuracy of One Permutation Hashing (OPH) with
Circulant Permutations
- URL: http://arxiv.org/abs/2111.09544v1
- Date: Thu, 18 Nov 2021 06:52:15 GMT
- Title: C-OPH: Improving the Accuracy of One Permutation Hashing (OPH) with
Circulant Permutations
- Authors: Xiaoyun Li and Ping Li
- Abstract summary: We propose to incorporate the essential ideas of C-MinHash to improve the accuracy of One Permutation Hashing.
It can be shown that the estimation variance of the Jaccard similarity is strictly smaller than that of the existing (densified) OPH methods.
- Score: 25.356048456005023
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Minwise hashing (MinHash) is a classical method for efficiently estimating
the Jaccrad similarity in massive binary (0/1) data. To generate $K$ hash
values for each data vector, the standard theory of MinHash requires $K$
independent permutations. Interestingly, the recent work on "circulant MinHash"
(C-MinHash) has shown that merely two permutations are needed. The first
permutation breaks the structure of the data and the second permutation is
re-used $K$ time in a circulant manner. Surprisingly, the estimation accuracy
of C-MinHash is proved to be strictly smaller than that of the original
MinHash. The more recent work further demonstrates that practically only one
permutation is needed. Note that C-MinHash is different from the well-known
work on "One Permutation Hashing (OPH)" published in NIPS'12. OPH and its
variants using different "densification" schemes are popular alternatives to
the standard MinHash. The densification step is necessary in order to deal with
empty bins which exist in One Permutation Hashing.
In this paper, we propose to incorporate the essential ideas of C-MinHash to
improve the accuracy of One Permutation Hashing. Basically, we develop a new
densification method for OPH, which achieves the smallest estimation variance
compared to all existing densification schemes for OPH. Our proposed method is
named C-OPH (Circulant OPH). After the initial permutation (which breaks the
existing structure of the data), C-OPH only needs a "shorter" permutation of
length $D/K$ (instead of $D$), where $D$ is the original data dimension and $K$
is the total number of bins in OPH. This short permutation is re-used in $K$
bins in a circulant shifting manner. It can be shown that the estimation
variance of the Jaccard similarity is strictly smaller than that of the
existing (densified) OPH methods.
Related papers
- HashAttention: Semantic Sparsity for Faster Inference [91.54218318798603]
HashAttention is a principled approach casting pivotal token identification as a recommendation problem.
It efficiently identifies pivotal tokens for a given query in this Hamming space using bitwise operations.
It can reduce the number of tokens used by a factor of $1/32times$ for the Llama-3.1-8B model with LongBench.
arXiv Detail & Related papers (2024-12-19T02:34:15Z) - Pb-Hash: Partitioned b-bit Hashing [21.607059258448594]
We propose to re-use the hashes by partitioning the $B$ bits into $m$ chunks, e.g., $btimes m =B$.
Our theoretical analysis reveals that by partitioning the hash values into $m$ chunks, the accuracy would drop.
In some regions, Pb-Hash still works well even for $m$ much larger than 4.
arXiv Detail & Related papers (2023-06-28T06:05:47Z) - Differentially Private One Permutation Hashing and Bin-wise Consistent
Weighted Sampling [37.6593006747285]
Minwise hashing (MinHash) is a standard algorithm widely used in the industry for large-scale search and learning applications.
One permutation hashing (OPH) is an efficient alternative of MinHash which splits the data vectors into $K$ bins and generates hash values within each bin.
We propose DP-OPH framework with three variants: DP-OPH-fix, DP-OPH-re and DP-OPH-rand, depending on which densification strategy is adopted to deal with empty bins in OPH.
arXiv Detail & Related papers (2023-06-13T10:38:12Z) - Private estimation algorithms for stochastic block models and mixture
models [63.07482515700984]
General tools for designing efficient private estimation algorithms.
First efficient $(epsilon, delta)$-differentially private algorithm for both weak recovery and exact recovery.
arXiv Detail & Related papers (2023-01-11T09:12:28Z) - A Lower Bound of Hash Codes' Performance [122.88252443695492]
In this paper, we prove that inter-class distinctiveness and intra-class compactness among hash codes determine the lower bound of hash codes' performance.
We then propose a surrogate model to fully exploit the above objective by estimating the posterior of hash codes and controlling it, which results in a low-bias optimization.
By testing on a series of hash-models, we obtain performance improvements among all of them, with an up to $26.5%$ increase in mean Average Precision and an up to $20.5%$ increase in accuracy.
arXiv Detail & Related papers (2022-10-12T03:30:56Z) - Learning to Hash Naturally Sorts [84.90210592082829]
We introduce Naturally-Sorted Hashing (NSH) to train a deep hashing model with sorted results end-to-end.
NSH sort the Hamming distances of samples' hash codes and accordingly gather their latent representations for self-supervised training.
We describe a novel Sorted Noise-Contrastive Estimation (SortedNCE) loss that selectively picks positive and negative samples for contrastive learning.
arXiv Detail & Related papers (2022-01-31T16:19:02Z) - C-MinHash: Practically Reducing Two Permutations to Just One [25.356048456005023]
Traditional minwise hashing (MinHash) requires applying $K$ independent permutations to estimate the Jaccard similarity in massive binary (0/1) data.
Recent work on C-MinHash has shown that only two permutations are needed.
One single permutation is used for both the initial pre-processing step to break the structures in the data and the circulant hashing step to generate $K$ hashes.
arXiv Detail & Related papers (2021-09-10T00:08:47Z) - C-MinHash: Rigorously Reducing $K$ Permutations to Two [25.356048456005023]
Minwise hashing (MinHash) is an important and practical algorithm for generating random hashes to approximate the Jaccard (resemblance) similarity in massive binary (0/1) data.
We propose bf Circulant MinHash (C-MinHash) and provide the surprising theoretical results that we just need textbftwo independent random permutations.
arXiv Detail & Related papers (2021-09-07T21:06:33Z) - Procrustean Orthogonal Sparse Hashing [3.302605292858623]
We show that insect olfaction is structurally and functionally analogous to sparse hashing.
We present a novel method, Procrustean Orthogonal Sparse Hashing (POSH), that unifies these findings.
We propose two new methods, Binary OSL and SphericalHash, to address these deficiencies.
arXiv Detail & Related papers (2020-06-08T18:09:33Z) - Reinforcing Short-Length Hashing [61.75883795807109]
Existing methods have poor performance in retrieval using an extremely short-length hash code.
In this study, we propose a novel reinforcing short-length hashing (RSLH)
In this proposed RSLH, mutual reconstruction between the hash representation and semantic labels is performed to preserve the semantic information.
Experiments on three large-scale image benchmarks demonstrate the superior performance of RSLH under various short-length hashing scenarios.
arXiv Detail & Related papers (2020-04-24T02:23:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.