Minwise-Independent Permutations with Insertion and Deletion of Features
- URL: http://arxiv.org/abs/2308.11240v1
- Date: Tue, 22 Aug 2023 07:27:45 GMT
- Title: Minwise-Independent Permutations with Insertion and Deletion of Features
- Authors: Rameshwar Pratap and Raghav Kulkarni
- Abstract summary: Broder textitet. al.citepBroderCFM98 introduces the $mathrmminHash$ algorithm that computes a low-dimensional sketch of binary data.
We show a rigorous theoretical analysis of our algorithms and complement it with extensive experiments on several real-world datasets.
- Score: 0.07252027234425332
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In their seminal work, Broder \textit{et. al.}~\citep{BroderCFM98} introduces
the $\mathrm{minHash}$ algorithm that computes a low-dimensional sketch of
high-dimensional binary data that closely approximates pairwise Jaccard
similarity. Since its invention, $\mathrm{minHash}$ has been commonly used by
practitioners in various big data applications. Further, the data is dynamic in
many real-life scenarios, and their feature sets evolve over time. We consider
the case when features are dynamically inserted and deleted in the dataset. We
note that a naive solution to this problem is to repeatedly recompute
$\mathrm{minHash}$ with respect to the updated dimension. However, this is an
expensive task as it requires generating fresh random permutations. To the best
of our knowledge, no systematic study of $\mathrm{minHash}$ is recorded in the
context of dynamic insertion and deletion of features. In this work, we
initiate this study and suggest algorithms that make the $\mathrm{minHash}$
sketches adaptable to the dynamic insertion and deletion of features. We show a
rigorous theoretical analysis of our algorithms and complement it with
extensive experiments on several real-world datasets. Empirically we observe a
significant speed-up in the running time while simultaneously offering
comparable performance with respect to running $\mathrm{minHash}$ from scratch.
Our proposal is efficient, accurate, and easy to implement in practice.
Related papers
- Perturb-and-Project: Differentially Private Similarities and Marginals [73.98880839337873]
We revisit the input perturbations framework for differential privacy where noise is added to the input $Ain mathcalS$.
We first design novel efficient algorithms to privately release pair-wise cosine similarities.
We derive a novel algorithm to compute $k$-way marginal queries over $n$ features.
arXiv Detail & Related papers (2024-06-07T12:07:16Z) - Provably Efficient High-Dimensional Bandit Learning with Batched
Feedbacks [93.00280593719513]
We study high-dimensional multi-armed contextual bandits with batched feedback where the $T$ steps of online interactions are divided into $L$ batches.
In specific, each batch collects data according to a policy that depends on previous batches and the rewards are revealed only at the end of the batch.
Our algorithm achieves regret bounds comparable to those in fully sequential setting with only $mathcalO( log T)$ batches.
arXiv Detail & Related papers (2023-11-22T06:06:54Z) - Pb-Hash: Partitioned b-bit Hashing [21.607059258448594]
We propose to re-use the hashes by partitioning the $B$ bits into $m$ chunks, e.g., $btimes m =B$.
Our theoretical analysis reveals that by partitioning the hash values into $m$ chunks, the accuracy would drop.
In some regions, Pb-Hash still works well even for $m$ much larger than 4.
arXiv Detail & Related papers (2023-06-28T06:05:47Z) - Adaptive and Dynamic Multi-Resolution Hashing for Pairwise Summations [19.602149096819776]
We propose Adam-Hash: an adaptive and dynamic multi-resolution hashing data-structure for fast pairwise summation estimation.
Our proposed Adam-Hash is also robust to adaptive PSE queries, where an adversary can choose query $q_j in mathbbRd$ depending on the output from previous queries.
arXiv Detail & Related papers (2022-12-21T23:23:24Z) - Asymmetric Scalable Cross-modal Hashing [51.309905690367835]
Cross-modal hashing is a successful method to solve large-scale multimedia retrieval issue.
We propose a novel Asymmetric Scalable Cross-Modal Hashing (ASCMH) to address these issues.
Our ASCMH outperforms the state-of-the-art cross-modal hashing methods in terms of accuracy and efficiency.
arXiv Detail & Related papers (2022-07-26T04:38:47Z) - Scalable Differentially Private Clustering via Hierarchically Separated
Trees [82.69664595378869]
We show that our method computes a solution with cost at most $O(d3/2log n)cdot OPT + O(k d2 log2 n / epsilon2)$, where $epsilon$ is the privacy guarantee.
Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical.
arXiv Detail & Related papers (2022-06-17T09:24:41Z) - Nearly Minimax Algorithms for Linear Bandits with Shared Representation [86.79657561369397]
We consider the setting where we play $M$ linear bandits with dimension $d$, each for $T$ rounds, and these $M$ bandit tasks share a common $k(ll d)$ dimensional linear representation.
We come up with novel algorithms that achieve $widetildeOleft(dsqrtkMT + kMsqrtTright)$ regret bounds, which matches the known minimax regret lower bound up to logarithmic factors.
arXiv Detail & Related papers (2022-03-29T15:27:13Z) - C-MinHash: Rigorously Reducing $K$ Permutations to Two [25.356048456005023]
Minwise hashing (MinHash) is an important and practical algorithm for generating random hashes to approximate the Jaccard (resemblance) similarity in massive binary (0/1) data.
We propose bf Circulant MinHash (C-MinHash) and provide the surprising theoretical results that we just need textbftwo independent random permutations.
arXiv Detail & Related papers (2021-09-07T21:06:33Z) - List-Decodable Mean Estimation in Nearly-PCA Time [50.79691056481693]
We study the fundamental task of list-decodable mean estimation in high dimensions.
Our algorithm runs in time $widetildeO(ndk)$ for all $k = O(sqrtd) cup Omega(d)$, where $n$ is the size of the dataset.
A variant of our algorithm has runtime $widetildeO(ndk)$ for all $k$, at the expense of an $O(sqrtlog k)$ factor in the recovery guarantee
arXiv Detail & Related papers (2020-11-19T17:21:37Z) - An Efficient $k$-modes Algorithm for Clustering Categorical Datasets [8.528384027684194]
We provide a novel, computationally efficient implementation of $k$-modes, called OTQT.
We prove that OTQT finds updates to improve the objective function that are undetectable to existing $k$-modes algorithms.
OTQT is always more accurate per iteration and almost always faster (and only barely slower on some datasets) to the final optimum.
arXiv Detail & Related papers (2020-06-06T18:41:36Z) - DartMinHash: Fast Sketching for Weighted Sets [0.5584060970507505]
Weighted minwise hashing is a standard dimensionality reduction technique with applications to similarity search and large-scale kernel machines.
We introduce a simple algorithm that takes a weighted set $x in mathbbR_geq 0d$ and computes $k$ independent minhashes in expected time.
arXiv Detail & Related papers (2020-05-23T14:59:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.