Related papers: DartMinHash: Fast Sketching for Weighted Sets

DartMinHash: Fast Sketching for Weighted Sets

URL: http://arxiv.org/abs/2005.11547v1
Date: Sat, 23 May 2020 14:59:25 GMT
Title: DartMinHash: Fast Sketching for Weighted Sets
Authors: Tobias Christiani
Abstract summary: Weighted minwise hashing is a standard dimensionality reduction technique with applications to similarity search and large-scale kernel machines. We introduce a simple algorithm that takes a weighted set $x in mathbbR_geq 0d$ and computes $k$ independent minhashes in expected time.
Score: 0.5584060970507505
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Weighted minwise hashing is a standard dimensionality reduction technique with applications to similarity search and large-scale kernel machines. We introduce a simple algorithm that takes a weighted set $x \in \mathbb{R}_{\geq 0}^{d}$ and computes $k$ independent minhashes in expected time $O(k \log k + \Vert x \Vert_{0}\log( \Vert x \Vert_1 + 1/\Vert x \Vert_1))$, improving upon the state-of-the-art BagMinHash algorithm (KDD '18) and representing the fastest weighted minhash algorithm for sparse data. Our experiments show running times that scale better with $k$ and $\Vert x \Vert_0$ compared to ICWS (ICDM '10) and BagMinhash, obtaining $10$x speedups in common use cases. Our approach also gives rise to a technique for computing fully independent locality-sensitive hash values for $(L, K)$-parameterized approximate near neighbor search under weighted Jaccard similarity in optimal expected time $O(LK + \Vert x \Vert_0)$, improving on prior work even in the case of unweighted sets.

Related papers

Mini-Batch Kernel $k$-means [4.604003661048267]
A single iteration of our algorithm takes $widetildeO(kb2)$ time, significantly faster than the $O(n2)$ time required by the full batch kernel $k$-means. Experiments demonstrate that our algorithm consistently achieves a 10-100x speedup with minimal loss in quality.
arXiv Detail & Related papers (2024-10-08T10:59:14Z)
Faster Private Minimum Spanning Trees [11.72102598708538]
We present a new differentially private MST algorithm that matches the utility of existing in-place methods while running in time. We present a data structure that allows us to sample a noisy minimum weight edge among at most $O(n2)$ cut edges in $O(sqrtn log n)$ time.
arXiv Detail & Related papers (2024-08-13T16:00:30Z)
A Scalable Algorithm for Individually Fair K-means Clustering [77.93955971520549]
We present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung et al. and Mahabadi et al. A clustering is then called individually fair if it has centers within distance $delta(x)$ of $x$ for each $xin P$. We show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.
arXiv Detail & Related papers (2024-02-09T19:01:48Z)
Minwise-Independent Permutations with Insertion and Deletion of Features [0.07252027234425332]
Broder textitet. al.citepBroderCFM98 introduces the $mathrmminHash$ algorithm that computes a low-dimensional sketch of binary data. We show a rigorous theoretical analysis of our algorithms and complement it with extensive experiments on several real-world datasets.
arXiv Detail & Related papers (2023-08-22T07:27:45Z)
Do you know what q-means? [50.045011844765185]
Clustering is one of the most important tools for analysis of large datasets. We present an improved version of the "$q$-means" algorithm for clustering. We also present a "dequantized" algorithm for $varepsilon which runs in $Obig(frack2varepsilon2(sqrtkd + log(Nd))big.
arXiv Detail & Related papers (2023-08-18T17:52:12Z)
Efficiently Learning One-Hidden-Layer ReLU Networks via Schur Polynomials [50.90125395570797]
We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $mathbbRd$ with respect to the square loss. Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/epsilon)O(k)$, whereepsilon>0$ is the target accuracy.
arXiv Detail & Related papers (2023-07-24T14:37:22Z)
Private estimation algorithms for stochastic block models and mixture models [63.07482515700984]
General tools for designing efficient private estimation algorithms. First efficient $(epsilon, delta)$-differentially private algorithm for both weak recovery and exact recovery.
arXiv Detail & Related papers (2023-01-11T09:12:28Z)
C-MinHash: Rigorously Reducing $K$ Permutations to Two [25.356048456005023]
Minwise hashing (MinHash) is an important and practical algorithm for generating random hashes to approximate the Jaccard (resemblance) similarity in massive binary (0/1) data. We propose bf Circulant MinHash (C-MinHash) and provide the surprising theoretical results that we just need textbftwo independent random permutations.
arXiv Detail & Related papers (2021-09-07T21:06:33Z)
Clustering Mixture Models in Almost-Linear Time via List-Decodable Mean Estimation [58.24280149662003]
We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset. We develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees.
arXiv Detail & Related papers (2021-06-16T03:34:14Z)
List-Decodable Mean Estimation in Nearly-PCA Time [50.79691056481693]
We study the fundamental task of list-decodable mean estimation in high dimensions. Our algorithm runs in time $widetildeO(ndk)$ for all $k = O(sqrtd) cup Omega(d)$, where $n$ is the size of the dataset. A variant of our algorithm has runtime $widetildeO(ndk)$ for all $k$, at the expense of an $O(sqrtlog k)$ factor in the recovery guarantee
arXiv Detail & Related papers (2020-11-19T17:21:37Z)
On Efficient Low Distortion Ultrametric Embedding [18.227854382422112]
A widely-used method to preserve the underlying hierarchical structure of the data is to find an embedding of the data into a tree or an ultrametric. In this paper, we provide a new algorithm which takes as input a set of points isometric in $mathbbRd2 (for universal constant $rho>1$) to output an ultrametric $Delta. We show that the output of our algorithm is comparable to the output of the linkage algorithms while achieving a much faster running time.
arXiv Detail & Related papers (2020-08-15T11:06:45Z)
Streaming Complexity of SVMs [110.63976030971106]
We study the space complexity of solving the bias-regularized SVM problem in the streaming model. We show that for both problems, for dimensions of $frac1lambdaepsilon$, one can obtain streaming algorithms with spacely smaller than $frac1lambdaepsilon$.
arXiv Detail & Related papers (2020-07-07T17:10:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.