Introduction to Coresets: Approximated Mean
- URL: http://arxiv.org/abs/2111.03046v1
- Date: Thu, 4 Nov 2021 17:49:38 GMT
- Title: Introduction to Coresets: Approximated Mean
- Authors: Alaa Maalouf and Ibrahim Jubran and Dan Feldman
- Abstract summary: A emphstrong coreset for the mean queries of a set $P$ in $mathbbRd$ is a small weighted subset $Csubseteq P$.
A emphweak coreset is (also) a small weighted subset $C$ of $P$, whose mean approximates the mean of $P$.
- Score: 29.520871474641485
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A \emph{strong coreset} for the mean queries of a set $P$ in ${\mathbb{R}}^d$
is a small weighted subset $C\subseteq P$, which provably approximates its sum
of squared distances to any center (point) $x\in {\mathbb{R}}^d$. A \emph{weak
coreset} is (also) a small weighted subset $C$ of $P$, whose mean approximates
the mean of $P$. While computing the mean of $P$ can be easily computed in
linear time, its coreset can be used to solve harder constrained version, and
is in the heart of generalizations such as coresets for $k$-means clustering.
In this paper, we survey most of the mean coreset construction techniques, and
suggest a unified analysis methodology for providing and explaining classical
and modern results including step-by-step proofs. In particular, we collected
folklore and scattered related results, some of which are not formally stated
elsewhere. Throughout this survey, we present, explain, and prove a set of
techniques, reductions, and algorithms very widespread and crucial in this
field. However, when put to use in the (relatively simple) mean problem, such
techniques are much simpler to grasp. The survey may help guide new researchers
unfamiliar with the field, and introduce them to the very basic foundations of
coresets, through a simple, yet fundamental, problem. Experts in this area
might appreciate the unified analysis flow, and the comparison table for
existing results. Finally, to encourage and help practitioners and software
engineers, we provide full open source code for all presented algorithms.
Related papers
- Approximate Algorithms For $k$-Sparse Wasserstein Barycenter With Outliers [10.259254824702555]
We study the $k$-sparse Wasserstein Barycenter problem in the presence of outliers.
Existing WB algorithms cannot be directly extended to handle the case with outliers.
We propose a clustering based LP method that yields constant approximation factor for the $k$-sparse WB with outliers problem.
arXiv Detail & Related papers (2024-04-20T15:01:35Z) - Do you know what q-means? [50.045011844765185]
Clustering is one of the most important tools for analysis of large datasets.
We present an improved version of the "$q$-means" algorithm for clustering.
We also present a "dequantized" algorithm for $varepsilon which runs in $Obig(frack2varepsilon2(sqrtkd + log(Nd))big.
arXiv Detail & Related papers (2023-08-18T17:52:12Z) - Efficiently Learning One-Hidden-Layer ReLU Networks via Schur
Polynomials [50.90125395570797]
We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $mathbbRd$ with respect to the square loss.
Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/epsilon)O(k)$, whereepsilon>0$ is the target accuracy.
arXiv Detail & Related papers (2023-07-24T14:37:22Z) - Manifold Free Riemannian Optimization [4.484251538832438]
A principled framework for solving optimization problems with a smooth manifold $mathcalM$ is proposed.
We use a noiseless sample set of the cost function $(x_i, y_i)in mathcalM times mathbbR$ and the intrinsic dimension of the manifold $mathcalM$.
arXiv Detail & Related papers (2022-09-07T16:19:06Z) - An Empirical Evaluation of $k$-Means Coresets [4.45709593827781]
There is no work on comparing the quality of available $k$-means coresets.
We propose a benchmark for which we argue that computing coresets is challenging.
We conduct an exhaustive evaluation of the most commonly used coreset algorithms from theory and practice.
arXiv Detail & Related papers (2022-07-03T06:47:53Z) - FriendlyCore: Practical Differentially Private Aggregation [67.04951703461657]
We propose a simple and practical tool $mathsfFriendlyCore$ that takes a set of points $cal D$ from an unrestricted (pseudo) metric space as input.
When $cal D$ has effective diameter $r$, $mathsfFriendlyCore$ returns a "stable" subset $cal D_Gsubseteq cal D$ that includes all points.
$mathsfFriendlyCore$ can be used to preprocess the input before privately aggregating it, potentially simplifying the aggregation or boosting its accuracy
arXiv Detail & Related papers (2021-10-19T17:43:50Z) - Clustering Mixture Models in Almost-Linear Time via List-Decodable Mean
Estimation [58.24280149662003]
We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset.
We develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees.
arXiv Detail & Related papers (2021-06-16T03:34:14Z) - List-Decodable Mean Estimation in Nearly-PCA Time [50.79691056481693]
We study the fundamental task of list-decodable mean estimation in high dimensions.
Our algorithm runs in time $widetildeO(ndk)$ for all $k = O(sqrtd) cup Omega(d)$, where $n$ is the size of the dataset.
A variant of our algorithm has runtime $widetildeO(ndk)$ for all $k$, at the expense of an $O(sqrtlog k)$ factor in the recovery guarantee
arXiv Detail & Related papers (2020-11-19T17:21:37Z) - Average-case Complexity of Teaching Convex Polytopes via Halfspace
Queries [55.28642461328172]
We show that the average-case teaching complexity is $Theta(d)$, which is in sharp contrast to the worst-case teaching complexity of $Theta(n)$.
Our insights allow us to establish a tight bound on the average-case complexity for $phi$-separable dichotomies.
arXiv Detail & Related papers (2020-06-25T19:59:24Z) - List-Decodable Mean Estimation via Iterative Multi-Filtering [44.805549762166926]
We are given a set $T$ of points in $mathbbRd$ with the promise that an unknown $alpha$-fraction of points in $T$ are drawn from an unknown mean and bounded covariance distribution $D$.
We output a small list of hypothesis vectors such that at least one of them is close to the mean of $D$.
In more detail, our algorithm is sample and computationally efficient, and achieves information-theoretically near-optimal error.
arXiv Detail & Related papers (2020-06-18T17:47:37Z) - Faster PAC Learning and Smaller Coresets via Smoothed Analysis [25.358415142404752]
PAC-learning usually aims to compute a small subset ($varepsilon$-sample/net) from $n$ items.
Inspired by smoothed analysis, we suggest a natural generalization: approximate the emphaverage (instead of the worst-case) error over the queries.
arXiv Detail & Related papers (2020-06-09T18:25:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.