Computing Approximate $\ell_p$ Sensitivities
- URL: http://arxiv.org/abs/2311.04158v2
- Date: Tue, 21 Nov 2023 14:55:52 GMT
- Title: Computing Approximate $\ell_p$ Sensitivities
- Authors: Swati Padmanabhan, David P. Woodruff, and Qiuyi Zhang
- Abstract summary: We provide efficient algorithms for approximating $ell_p$ sensitivities and summary statistics of a given matrix.
For a wide class of matrices in real-world datasets, the total sensitivity can be quickly approximated and is significantly smaller than the theoretical prediction.
- Score: 46.95464524720463
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent works in dimensionality reduction for regression tasks have introduced
the notion of sensitivity, an estimate of the importance of a specific
datapoint in a dataset, offering provable guarantees on the quality of the
approximation after removing low-sensitivity datapoints via subsampling.
However, fast algorithms for approximating $\ell_p$ sensitivities, which we
show is equivalent to approximate $\ell_p$ regression, are known for only the
$\ell_2$ setting, in which they are termed leverage scores.
In this work, we provide efficient algorithms for approximating $\ell_p$
sensitivities and related summary statistics of a given matrix. In particular,
for a given $n \times d$ matrix, we compute $\alpha$-approximation to its
$\ell_1$ sensitivities at the cost of $O(n/\alpha)$ sensitivity computations.
For estimating the total $\ell_p$ sensitivity (i.e. the sum of $\ell_p$
sensitivities), we provide an algorithm based on importance sampling of
$\ell_p$ Lewis weights, which computes a constant factor approximation to the
total sensitivity at the cost of roughly $O(\sqrt{d})$ sensitivity
computations. Furthermore, we estimate the maximum $\ell_1$ sensitivity, up to
a $\sqrt{d}$ factor, using $O(d)$ sensitivity computations. We generalize all
these results to $\ell_p$ norms for $p > 1$. Lastly, we experimentally show
that for a wide class of matrices in real-world datasets, the total sensitivity
can be quickly approximated and is significantly smaller than the theoretical
prediction, demonstrating that real-world datasets have low intrinsic effective
dimensionality.
Related papers
- Turnstile $\ell_p$ leverage score sampling with applications [56.403488578703865]
We develop a novel algorithm for sampling rows $a_i$ of a matrix $AinmathbbRntimes d$, proportional to their $ell_p$ norm, when $A$ is presented in a turnstile data stream.
Our algorithm not only returns the set of sampled row indexes, it also returns slightly perturbed rows $tildea_i approx a_i$, and approximates their sampling probabilities up to $varepsilon$ relative error.
For logistic regression, our framework yields the first algorithm that achieves a $
arXiv Detail & Related papers (2024-06-01T07:33:41Z) - Optimal bounds for $\ell_p$ sensitivity sampling via $\ell_2$ augmentation [56.403488578703865]
We show that by augmenting the $ell_p$ sensitivities by $ell$ sensitivities, we obtain better bounds on optimal linear $tilde O(varepsilon-2mu2 d)$ sampling complexity.
We also obtain an $tilde O(varepsilon-2mu2 d)$ sensitivity sampling bound for logistic regression.
arXiv Detail & Related papers (2024-06-01T07:03:40Z) - Online Robust Mean Estimation [37.746091744197656]
We study the problem of high-dimensional robust mean estimation in an online setting.
We prove two main results about online robust mean estimation in this model.
arXiv Detail & Related papers (2023-10-24T15:28:43Z) - Data Structures for Density Estimation [66.36971978162461]
Given a sublinear (in $n$) number of samples from $p$, our main result is the first data structure that identifies $v_i$ in time sublinear in $k$.
We also give an improved version of the algorithm of Acharya et al. that reports $v_i$ in time linear in $k$.
arXiv Detail & Related papers (2023-06-20T06:13:56Z) - Sharper Bounds for $\ell_p$ Sensitivity Sampling [56.45770306797584]
We show the first bounds for sensitivity sampling for $ell_p$ subspace embeddings for $p.
We also show that the root leverage score sampling algorithm achieves a bound of roughly $d$ for $1leq p2$, and that a combination of leverage score and sensitivity sampling achieves an improved bound of roughly $d2/pmathfrak S2-4/p$ for $2pinfty.
arXiv Detail & Related papers (2023-06-01T14:27:28Z) - Active Sampling for Linear Regression Beyond the $\ell_2$ Norm [70.49273459706546]
We study active sampling algorithms for linear regression, which aim to query only a small number of entries of a target vector.
We show that this dependence on $d$ is optimal, up to logarithmic factors.
We also provide the first total sensitivity upper bound $O(dmax1,p/2log2 n)$ for loss functions with at most degree $p$ growth.
arXiv Detail & Related papers (2021-11-09T00:20:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.