PCCC: The Pairwise-Confidence-Constraints-Clustering Algorithm
- URL: http://arxiv.org/abs/2212.14437v1
- Date: Thu, 29 Dec 2022 19:21:33 GMT
- Title: PCCC: The Pairwise-Confidence-Constraints-Clustering Algorithm
- Authors: Philipp Baumann and Dorit S. Hochbaum
- Abstract summary: We introduce the PCCC algorithm, which iteratively assigns objects to clusters while accounting for the information provided on the pairs of objects.
Our algorithm can include relationships as hard constraints that are guaranteed to be satisfied or as soft constraints that can be violated.
Unlike existing algorithms, our algorithm scales to large-scale instances with up to 60,000 objects, 100 clusters, and millions of cannot-link constraints.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider a semi-supervised $k$-clustering problem where information is
available on whether pairs of objects are in the same or in different clusters.
This information is either available with certainty or with a limited level of
confidence. We introduce the PCCC algorithm, which iteratively assigns objects
to clusters while accounting for the information provided on the pairs of
objects. Our algorithm can include relationships as hard constraints that are
guaranteed to be satisfied or as soft constraints that can be violated subject
to a penalty. This flexibility distinguishes our algorithm from the
state-of-the-art in which all pairwise constraints are either considered hard,
or all are considered soft. Unlike existing algorithms, our algorithm scales to
large-scale instances with up to 60,000 objects, 100 clusters, and millions of
cannot-link constraints (which are the most challenging constraints to
incorporate). We compare the PCCC algorithm with state-of-the-art approaches in
an extensive computational study. Even though the PCCC algorithm is more
general than the state-of-the-art approaches in its applicability, it
outperforms the state-of-the-art approaches on instances with all hard
constraints or all soft constraints both in terms of running time and various
metrics of solution quality. The source code of the PCCC algorithm is publicly
available on GitHub.
Related papers
- From Large to Small Datasets: Size Generalization for Clustering
Algorithm Selection [12.993073967843292]
We study a problem in a semi-supervised setting, with an unknown ground-truth clustering.
We introduce a notion of size generalization for clustering algorithm accuracy.
We use a subsample of as little as 5% of the data to identify which algorithm is best on the full dataset.
arXiv Detail & Related papers (2024-02-22T06:53:35Z) - Near-Optimal Algorithms for Constrained k-Center Clustering with Instance-level Background Knowledge [12.808663917871888]
We build on widely adopted $k$-center clustering and model its input background knowledge as must-link (ML) and cannot-link (CL) constraint sets.
We arrive at the first efficient approximation algorithm for constrained $k$-center with the best possible ratio of 2.
arXiv Detail & Related papers (2024-01-23T07:16:32Z) - Differentially-Private Hierarchical Clustering with Provable
Approximation Guarantees [79.59010418610625]
We study differentially private approximation algorithms for hierarchical clustering.
We show strong lower bounds for the problem: that any $epsilon$-DP algorithm must exhibit $O(|V|2/ epsilon)$-additive error for an input dataset.
We propose a private $1+o(1)$ approximation algorithm which also recovers the blocks exactly.
arXiv Detail & Related papers (2023-01-31T19:14:30Z) - On Leave-One-Out Conditional Mutual Information For Generalization [122.2734338600665]
We derive information theoretic generalization bounds for supervised learning algorithms based on a new measure of leave-one-out conditional mutual information (loo-CMI)
Contrary to other CMI bounds, our loo-CMI bounds can be computed easily and can be interpreted in connection to other notions such as classical leave-one-out cross-validation.
We empirically validate the quality of the bound by evaluating its predicted generalization gap in scenarios for deep learning.
arXiv Detail & Related papers (2022-07-01T17:58:29Z) - An Exact Algorithm for Semi-supervised Minimum Sum-of-Squares Clustering [0.5801044612920815]
We present a new branch-and-bound algorithm for semi-supervised MSSC.
Background knowledge is incorporated as pairwise must-link and cannot-link constraints.
For the first time, the proposed global optimization algorithm efficiently manages to solve real-world instances up to 800 data points.
arXiv Detail & Related papers (2021-11-30T17:08:53Z) - A black-box adversarial attack for poisoning clustering [78.19784577498031]
We propose a black-box adversarial attack for crafting adversarial samples to test the robustness of clustering algorithms.
We show that our attacks are transferable even against supervised algorithms such as SVMs, random forests, and neural networks.
arXiv Detail & Related papers (2020-09-09T18:19:31Z) - Differentially Private Clustering: Tight Approximation Ratios [57.89473217052714]
We give efficient differentially private algorithms for basic clustering problems.
Our results imply an improved algorithm for the Sample and Aggregate privacy framework.
One of the tools used in our 1-Cluster algorithm can be employed to get a faster quantum algorithm for ClosestPair in a moderate number of dimensions.
arXiv Detail & Related papers (2020-08-18T16:22:06Z) - An Integer Linear Programming Framework for Mining Constraints from Data [81.60135973848125]
We present a general framework for mining constraints from data.
In particular, we consider the inference in structured output prediction as an integer linear programming (ILP) problem.
We show that our approach can learn to solve 9x9 Sudoku puzzles and minimal spanning tree problems from examples without providing the underlying rules.
arXiv Detail & Related papers (2020-06-18T20:09:53Z) - Beyond Worst-Case Analysis in Stochastic Approximation: Moment
Estimation Improves Instance Complexity [58.70807593332932]
We study oracle complexity of gradient based methods for approximation problems.
We focus on instance-dependent complexity instead of worst case complexity.
Our proposed algorithm and its analysis provide a theoretical justification for the success of moment estimation.
arXiv Detail & Related papers (2020-06-08T09:25:47Z) - A semi-supervised sparse K-Means algorithm [3.04585143845864]
An unsupervised sparse clustering method can be employed in order to detect the subgroup of features necessary for clustering.
A semi-supervised method can use the labelled data to create constraints and enhance the clustering solution.
We show that the algorithm maintains the high performance of other semi-supervised algorithms and in addition preserves the ability to identify informative from uninformative features.
arXiv Detail & Related papers (2020-03-16T02:05:23Z) - Differentially Private k-Means Clustering with Guaranteed Convergence [5.335316436366718]
Iterative clustering algorithms help us to learn the insights behind the data.
It may allow adversaries to infer the privacy of individuals with some background knowledge.
To protect individual privacy against such an inference attack, preserving differential privacy (DP) for the iterative clustering algorithms has been extensively studied.
arXiv Detail & Related papers (2020-02-03T22:53:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.