Related papers: Pb-Hash: Partitioned b-bit Hashing

Pb-Hash: Partitioned b-bit Hashing

URL: http://arxiv.org/abs/2306.15944v1
Date: Wed, 28 Jun 2023 06:05:47 GMT
Title: Pb-Hash: Partitioned b-bit Hashing
Authors: Ping Li, Weijie Zhao
Abstract summary: We propose to re-use the hashes by partitioning the $B$ bits into $m$ chunks, e.g., $btimes m =B$. Our theoretical analysis reveals that by partitioning the hash values into $m$ chunks, the accuracy would drop. In some regions, Pb-Hash still works well even for $m$ much larger than 4.
Score: 21.607059258448594
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Many hashing algorithms including minwise hashing (MinHash), one permutation hashing (OPH), and consistent weighted sampling (CWS) generate integers of $B$ bits. With $k$ hashes for each data vector, the storage would be $B\times k$ bits; and when used for large-scale learning, the model size would be $2^B\times k$, which can be expensive. A standard strategy is to use only the lowest $b$ bits out of the $B$ bits and somewhat increase $k$, the number of hashes. In this study, we propose to re-use the hashes by partitioning the $B$ bits into $m$ chunks, e.g., $b\times m =B$. Correspondingly, the model size becomes $m\times 2^b \times k$, which can be substantially smaller than the original $2^B\times k$. Our theoretical analysis reveals that by partitioning the hash values into $m$ chunks, the accuracy would drop. In other words, using $m$ chunks of $B/m$ bits would not be as accurate as directly using $B$ bits. This is due to the correlation from re-using the same hash. On the other hand, our analysis also shows that the accuracy would not drop much for (e.g.,) $m=2\sim 4$. In some regions, Pb-Hash still works well even for $m$ much larger than 4. We expect Pb-Hash would be a good addition to the family of hashing methods/applications and benefit industrial practitioners. We verify the effectiveness of Pb-Hash in machine learning tasks, for linear SVM models as well as deep learning models. Since the hashed data are essentially categorical (ID) features, we follow the standard practice of using embedding tables for each hash. With Pb-Hash, we need to design an effective strategy to combine $m$ embeddings. Our study provides an empirical evaluation on four pooling schemes: concatenation, max pooling, mean pooling, and product pooling. There is no definite answer which pooling would be always better and we leave that for future study.

Related papers

HashAttention: Semantic Sparsity for Faster Inference [91.54218318798603]
HashAttention is a principled approach casting pivotal token identification as a recommendation problem. It efficiently identifies pivotal tokens for a given query in this Hamming space using bitwise operations. It can reduce the number of tokens used by a factor of $1/32times$ for the Llama-3.1-8B model with LongBench.
arXiv Detail & Related papers (2024-12-19T02:34:15Z)
Minwise-Independent Permutations with Insertion and Deletion of Features [0.07252027234425332]
Broder textitet. al.citepBroderCFM98 introduces the $mathrmminHash$ algorithm that computes a low-dimensional sketch of binary data. We show a rigorous theoretical analysis of our algorithms and complement it with extensive experiments on several real-world datasets.
arXiv Detail & Related papers (2023-08-22T07:27:45Z)
Differentially Private One Permutation Hashing and Bin-wise Consistent Weighted Sampling [37.6593006747285]
Minwise hashing (MinHash) is a standard algorithm widely used in the industry for large-scale search and learning applications. One permutation hashing (OPH) is an efficient alternative of MinHash which splits the data vectors into $K$ bins and generates hash values within each bin. We propose DP-OPH framework with three variants: DP-OPH-fix, DP-OPH-re and DP-OPH-rand, depending on which densification strategy is adopted to deal with empty bins in OPH.
arXiv Detail & Related papers (2023-06-13T10:38:12Z)
A Lower Bound of Hash Codes' Performance [122.88252443695492]
In this paper, we prove that inter-class distinctiveness and intra-class compactness among hash codes determine the lower bound of hash codes' performance. We then propose a surrogate model to fully exploit the above objective by estimating the posterior of hash codes and controlling it, which results in a low-bias optimization. By testing on a series of hash-models, we obtain performance improvements among all of them, with an up to $26.5%$ increase in mean Average Precision and an up to $20.5%$ increase in accuracy.
arXiv Detail & Related papers (2022-10-12T03:30:56Z)
Quantum Resources Required to Block-Encode a Matrix of Classical Data [56.508135743727934]
We provide circuit-level implementations and resource estimates for several methods of block-encoding a dense $Ntimes N$ matrix of classical data to precision $epsilon$. We examine resource tradeoffs between the different approaches and explore implementations of two separate models of quantum random access memory (QRAM) Our results go beyond simple query complexity and provide a clear picture into the resource costs when large amounts of classical data are assumed to be accessible to quantum algorithms.
arXiv Detail & Related papers (2022-06-07T18:00:01Z)
One Loss for All: Deep Hashing with a Single Cosine Similarity based Learning Objective [86.48094395282546]
A deep hashing model typically has two main learning objectives: to make the learned binary hash codes discriminative and to minimize a quantization error. We propose a novel deep hashing model with only a single learning objective. Our model is highly effective, outperforming the state-of-the-art multi-loss hashing models on three large-scale instance retrieval benchmarks.
arXiv Detail & Related papers (2021-09-29T14:27:51Z)
C-MinHash: Practically Reducing Two Permutations to Just One [25.356048456005023]
Traditional minwise hashing (MinHash) requires applying $K$ independent permutations to estimate the Jaccard similarity in massive binary (0/1) data. Recent work on C-MinHash has shown that only two permutations are needed. One single permutation is used for both the initial pre-processing step to break the structures in the data and the circulant hashing step to generate $K$ hashes.
arXiv Detail & Related papers (2021-09-10T00:08:47Z)
C-MinHash: Rigorously Reducing $K$ Permutations to Two [25.356048456005023]
Minwise hashing (MinHash) is an important and practical algorithm for generating random hashes to approximate the Jaccard (resemblance) similarity in massive binary (0/1) data. We propose bf Circulant MinHash (C-MinHash) and provide the surprising theoretical results that we just need textbftwo independent random permutations.
arXiv Detail & Related papers (2021-09-07T21:06:33Z)
Under-bagging Nearest Neighbors for Imbalanced Classification [63.026765294759876]
We propose an ensemble learning algorithm called textitunder-bagging $k$-NN (textitunder-bagging $k$-NN) for imbalanced classification problems.
arXiv Detail & Related papers (2021-09-01T14:10:38Z)
List-Decodable Mean Estimation in Nearly-PCA Time [50.79691056481693]
We study the fundamental task of list-decodable mean estimation in high dimensions. Our algorithm runs in time $widetildeO(ndk)$ for all $k = O(sqrtd) cup Omega(d)$, where $n$ is the size of the dataset. A variant of our algorithm has runtime $widetildeO(ndk)$ for all $k$, at the expense of an $O(sqrtlog k)$ factor in the recovery guarantee
arXiv Detail & Related papers (2020-11-19T17:21:37Z)
A Randomized Algorithm to Reduce the Support of Discrete Measures [79.55586575988292]
Given a discrete probability measure supported on $N$ atoms and a set of $n$ real-valued functions, there exists a probability measure that is supported on a subset of $n+1$ of the original $N$ atoms. We give a simple geometric characterization of barycenters via negative cones and derive a randomized algorithm that computes this new measure by "greedy geometric sampling" We then study its properties, and benchmark it on synthetic and real-world data to show that it can be very beneficial in the $Ngg n$ regime.
arXiv Detail & Related papers (2020-06-02T16:38:36Z)
DartMinHash: Fast Sketching for Weighted Sets [0.5584060970507505]
Weighted minwise hashing is a standard dimensionality reduction technique with applications to similarity search and large-scale kernel machines. We introduce a simple algorithm that takes a weighted set $x in mathbbR_geq 0d$ and computes $k$ independent minhashes in expected time.
arXiv Detail & Related papers (2020-05-23T14:59:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.