Sampling a Near Neighbor in High Dimensions -- Who is the Fairest of
Them All?
- URL: http://arxiv.org/abs/2101.10905v1
- Date: Tue, 26 Jan 2021 16:13:07 GMT
- Title: Sampling a Near Neighbor in High Dimensions -- Who is the Fairest of
Them All?
- Authors: Martin Aum\"uller, Sariel Har-Peled, Sepideh Mahabadi, Rasmus Pagh,
Francesco Silvestri
- Abstract summary: We study the $r$-NN problem in the light of individual fairness and providing equal opportunities.
All points that are within distance $r$ from the query should have the same probability to be returned.
We propose several efficient data structures for the exact and approximate variants of the fair NN problem.
- Score: 17.514226416388475
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Similarity search is a fundamental algorithmic primitive, widely used in many
computer science disciplines. Given a set of points $S$ and a radius parameter
$r>0$, the $r$-near neighbor ($r$-NN) problem asks for a data structure that,
given any query point $q$, returns a point $p$ within distance at most $r$ from
$q$. In this paper, we study the $r$-NN problem in the light of individual
fairness and providing equal opportunities: all points that are within distance
$r$ from the query should have the same probability to be returned. In the
low-dimensional case, this problem was first studied by Hu, Qiao, and Tao (PODS
2014). Locality sensitive hashing (LSH), the theoretically strongest approach
to similarity search in high dimensions, does not provide such a fairness
guarantee. In this work, we show that LSH based algorithms can be made fair,
without a significant loss in efficiency. We propose several efficient data
structures for the exact and approximate variants of the fair NN problem. Our
approach works more generally for sampling uniformly from a sub-collection of
sets of a given collection and can be used in a few other applications. We also
develop a data structure for fair similarity search under inner product that
requires nearly-linear space and exploits locality sensitive filters. The paper
concludes with an experimental evaluation that highlights the inherent
unfairness of NN data structures and shows the performance of our algorithms on
real-world datasets.
Related papers
- Perturb-and-Project: Differentially Private Similarities and Marginals [73.98880839337873]
We revisit the input perturbations framework for differential privacy where noise is added to the input $Ain mathcalS$.
We first design novel efficient algorithms to privately release pair-wise cosine similarities.
We derive a novel algorithm to compute $k$-way marginal queries over $n$ features.
arXiv Detail & Related papers (2024-06-07T12:07:16Z) - A Scalable Algorithm for Individually Fair K-means Clustering [77.93955971520549]
We present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung et al. and Mahabadi et al.
A clustering is then called individually fair if it has centers within distance $delta(x)$ of $x$ for each $xin P$.
We show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.
arXiv Detail & Related papers (2024-02-09T19:01:48Z) - A Combinatorial Approach to Robust PCA [18.740048806623037]
We study the problem of recovering Gaussian data under adversarial corruptions.
We assume that the Gaussian noises lie in an unknown $k$-dimensional subspace $U subseteq mathbbRd$, and $s$ randomly chosen coordinates of each data point fall into the control of an adversary.
Our main result is an efficient algorithm that, when $ks2 = O(d)$, recovers every single data point up to a nearly-optimal error of $tilde O(ks/d)$ in expectation.
arXiv Detail & Related papers (2023-11-28T01:49:51Z) - Scalable Differentially Private Clustering via Hierarchically Separated
Trees [82.69664595378869]
We show that our method computes a solution with cost at most $O(d3/2log n)cdot OPT + O(k d2 log2 n / epsilon2)$, where $epsilon$ is the privacy guarantee.
Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical.
arXiv Detail & Related papers (2022-06-17T09:24:41Z) - Systematically improving existing k-means initialization algorithms at
nearly no cost, by pairwise-nearest-neighbor smoothing [1.2570180539670577]
We present a meta-method for initializing the $k$-means clustering algorithm called PNN-smoothing.
It consists in splitting a given dataset into $J$ random subsets, clustering each of them individually, and merging the resulting clusterings with the pairwise-nearest-neighbor method.
arXiv Detail & Related papers (2022-02-08T15:56:30Z) - Under-bagging Nearest Neighbors for Imbalanced Classification [63.026765294759876]
We propose an ensemble learning algorithm called textitunder-bagging $k$-NN (textitunder-bagging $k$-NN) for imbalanced classification problems.
arXiv Detail & Related papers (2021-09-01T14:10:38Z) - Learning to Hash Robustly, with Guarantees [79.68057056103014]
In this paper, we design an NNS algorithm for the Hamming space that has worst-case guarantees essentially matching that of theoretical algorithms.
We evaluate the algorithm's ability to optimize for a given dataset both theoretically and practically.
Our algorithm has a 1.8x and 2.1x better recall on the worst-performing queries to the MNIST and ImageNet datasets.
arXiv Detail & Related papers (2021-08-11T20:21:30Z) - Better Algorithms for Individually Fair $k$-Clustering [4.936817539835045]
Jung, Kannan, and Lutz [FORC 2020] introduced a clustering algorithm with provable (approximate) fairness and objective guarantees for the $ell_infty$ or $k$-Center objective. Mahabadi and Vakilian [ICML 2020] revisited this problem to give a local-search algorithm for all $ell_p$-norms.
We prove that by modifying known LP rounding techniques, one gets a worst-case guarantee on the objective which is much better than in MV20, and empirically, this objective is extremely close to the optimal.
arXiv Detail & Related papers (2021-06-23T04:16:46Z) - Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem.
We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z) - Sample Complexity of Adversarially Robust Linear Classification on
Separated Data [41.30425613353595]
We consider the sample complexity of learning with adversarial robustness.
We show that for very well-separated data, convergence rates of $O(frac1n)$ are achievable.
arXiv Detail & Related papers (2020-12-19T22:04:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.