DartMinHash: Fast Sketching for Weighted Sets
- URL: http://arxiv.org/abs/2005.11547v1
- Date: Sat, 23 May 2020 14:59:25 GMT
- Title: DartMinHash: Fast Sketching for Weighted Sets
- Authors: Tobias Christiani
- Abstract summary: Weighted minwise hashing is a standard dimensionality reduction technique with applications to similarity search and large-scale kernel machines.
We introduce a simple algorithm that takes a weighted set $x in mathbbR_geq 0d$ and computes $k$ independent minhashes in expected time.
- Score: 0.5584060970507505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weighted minwise hashing is a standard dimensionality reduction technique
with applications to similarity search and large-scale kernel machines. We
introduce a simple algorithm that takes a weighted set $x \in \mathbb{R}_{\geq
0}^{d}$ and computes $k$ independent minhashes in expected time $O(k \log k +
\Vert x \Vert_{0}\log( \Vert x \Vert_1 + 1/\Vert x \Vert_1))$, improving upon
the state-of-the-art BagMinHash algorithm (KDD '18) and representing the
fastest weighted minhash algorithm for sparse data. Our experiments show
running times that scale better with $k$ and $\Vert x \Vert_0$ compared to ICWS
(ICDM '10) and BagMinhash, obtaining $10$x speedups in common use cases. Our
approach also gives rise to a technique for computing fully independent
locality-sensitive hash values for $(L, K)$-parameterized approximate near
neighbor search under weighted Jaccard similarity in optimal expected time
$O(LK + \Vert x \Vert_0)$, improving on prior work even in the case of
unweighted sets.
Related papers
- Mini-Batch Kernel $k$-means [4.604003661048267]
A single iteration of our algorithm takes $widetildeO(kb2)$ time, significantly faster than the $O(n2)$ time required by the full batch kernel $k$-means.
Experiments demonstrate that our algorithm consistently achieves a 10-100x speedup with minimal loss in quality.
arXiv Detail & Related papers (2024-10-08T10:59:14Z) - Faster Private Minimum Spanning Trees [11.72102598708538]
We present a new differentially private MST algorithm that matches the utility of existing in-place methods while running in time.
We present a data structure that allows us to sample a noisy minimum weight edge among at most $O(n2)$ cut edges in $O(sqrtn log n)$ time.
arXiv Detail & Related papers (2024-08-13T16:00:30Z) - A Scalable Algorithm for Individually Fair K-means Clustering [77.93955971520549]
We present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung et al. and Mahabadi et al.
A clustering is then called individually fair if it has centers within distance $delta(x)$ of $x$ for each $xin P$.
We show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.
arXiv Detail & Related papers (2024-02-09T19:01:48Z) - Minwise-Independent Permutations with Insertion and Deletion of Features [0.07252027234425332]
Broder textitet. al.citepBroderCFM98 introduces the $mathrmminHash$ algorithm that computes a low-dimensional sketch of binary data.
We show a rigorous theoretical analysis of our algorithms and complement it with extensive experiments on several real-world datasets.
arXiv Detail & Related papers (2023-08-22T07:27:45Z) - Do you know what q-means? [50.045011844765185]
Clustering is one of the most important tools for analysis of large datasets.
We present an improved version of the "$q$-means" algorithm for clustering.
We also present a "dequantized" algorithm for $varepsilon which runs in $Obig(frack2varepsilon2(sqrtkd + log(Nd))big.
arXiv Detail & Related papers (2023-08-18T17:52:12Z) - Efficiently Learning One-Hidden-Layer ReLU Networks via Schur
Polynomials [50.90125395570797]
We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $mathbbRd$ with respect to the square loss.
Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/epsilon)O(k)$, whereepsilon>0$ is the target accuracy.
arXiv Detail & Related papers (2023-07-24T14:37:22Z) - Private estimation algorithms for stochastic block models and mixture
models [63.07482515700984]
General tools for designing efficient private estimation algorithms.
First efficient $(epsilon, delta)$-differentially private algorithm for both weak recovery and exact recovery.
arXiv Detail & Related papers (2023-01-11T09:12:28Z) - C-MinHash: Rigorously Reducing $K$ Permutations to Two [25.356048456005023]
Minwise hashing (MinHash) is an important and practical algorithm for generating random hashes to approximate the Jaccard (resemblance) similarity in massive binary (0/1) data.
We propose bf Circulant MinHash (C-MinHash) and provide the surprising theoretical results that we just need textbftwo independent random permutations.
arXiv Detail & Related papers (2021-09-07T21:06:33Z) - List-Decodable Mean Estimation in Nearly-PCA Time [50.79691056481693]
We study the fundamental task of list-decodable mean estimation in high dimensions.
Our algorithm runs in time $widetildeO(ndk)$ for all $k = O(sqrtd) cup Omega(d)$, where $n$ is the size of the dataset.
A variant of our algorithm has runtime $widetildeO(ndk)$ for all $k$, at the expense of an $O(sqrtlog k)$ factor in the recovery guarantee
arXiv Detail & Related papers (2020-11-19T17:21:37Z) - On Efficient Low Distortion Ultrametric Embedding [18.227854382422112]
A widely-used method to preserve the underlying hierarchical structure of the data is to find an embedding of the data into a tree or an ultrametric.
In this paper, we provide a new algorithm which takes as input a set of points isometric in $mathbbRd2 (for universal constant $rho>1$) to output an ultrametric $Delta.
We show that the output of our algorithm is comparable to the output of the linkage algorithms while achieving a much faster running time.
arXiv Detail & Related papers (2020-08-15T11:06:45Z) - Streaming Complexity of SVMs [110.63976030971106]
We study the space complexity of solving the bias-regularized SVM problem in the streaming model.
We show that for both problems, for dimensions of $frac1lambdaepsilon$, one can obtain streaming algorithms with spacely smaller than $frac1lambdaepsilon$.
arXiv Detail & Related papers (2020-07-07T17:10:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.