Optimization of Inter-group Criteria for Clustering with Minimum Size
Constraints
- URL: http://arxiv.org/abs/2401.07091v1
- Date: Sat, 13 Jan 2024 14:59:12 GMT
- Title: Optimization of Inter-group Criteria for Clustering with Minimum Size
Constraints
- Authors: Eduardo S. Laber and Lucas Murtinho
- Abstract summary: Internal measures that are used to assess the quality of a clustering usually take into account intra-group and/or inter-group criteria.
We propose algorithms with provable approximation guarantees for optimizing the former.
We present an empirical study with 10 real datasets, providing evidence that our methods work very well in practical settings.
- Score: 2.7195102129095003
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Internal measures that are used to assess the quality of a clustering usually
take into account intra-group and/or inter-group criteria. There are many
papers in the literature that propose algorithms with provable approximation
guarantees for optimizing the former. However, the optimization of inter-group
criteria is much less understood.
Here, we contribute to the state-of-the-art of this literature by devising
algorithms with provable guarantees for the maximization of two natural
inter-group criteria, namely the minimum spacing and the minimum spanning tree
spacing. The former is the minimum distance between points in different groups
while the latter captures separability through the cost of the minimum spanning
tree that connects all groups. We obtain results for both the unrestricted
case, in which no constraint on the clusters is imposed, and for the
constrained case where each group is required to have a minimum number of
points. Our constraint is motivated by the fact that the popular Single
Linkage, which optimizes both criteria in the unrestricted case, produces
clusterings with many tiny groups.
To complement our work, we present an empirical study with 10 real datasets,
providing evidence that our methods work very well in practical settings.
Related papers
- Fair Clustering for Data Summarization: Improved Approximation Algorithms and Complexity Insights [16.120911591795295]
In some applications all data points can be chosen as centers, in the general setting, centers must be chosen from a subset of points, referred as facilities or suppliers.
In this work, we focus on fair data summarization modeled as the fair $k$-supplier problem, where data consists of several groups, and a minimum number of centers must be selected from each group.
arXiv Detail & Related papers (2024-10-16T18:00:19Z) - Fair Minimum Representation Clustering via Integer Programming [0.6906005491572401]
Clustering is an unsupervised learning task that aims to partition data into a set of clusters.
In this paper, we study the k-means and k-medians clustering problems with the additional constraint that each group must have a minimum level of representation.
We present an alternating minimization algorithm, called MiniReL, that directly incorporates the fairness constraints.
arXiv Detail & Related papers (2024-09-04T00:13:40Z) - Optimal Multi-Distribution Learning [88.3008613028333]
Multi-distribution learning seeks to learn a shared model that minimizes the worst-case risk across $k$ distinct data distributions.
We propose a novel algorithm that yields an varepsilon-optimal randomized hypothesis with a sample complexity on the order of (d+k)/varepsilon2.
arXiv Detail & Related papers (2023-12-08T16:06:29Z) - Fair Minimum Representation Clustering [0.0]
Clustering is an unsupervised learning task that aims to partition data into a set of clusters.
We show that the popular $k$-means algorithm, Lloyd's algorithm, can result in unfair outcomes.
We present a variant of Lloyd's algorithm, called MiniReL, that directly incorporates the fairness constraints.
arXiv Detail & Related papers (2023-02-06T23:16:38Z) - An Exact Algorithm for Semi-supervised Minimum Sum-of-Squares Clustering [0.5801044612920815]
We present a new branch-and-bound algorithm for semi-supervised MSSC.
Background knowledge is incorporated as pairwise must-link and cannot-link constraints.
For the first time, the proposed global optimization algorithm efficiently manages to solve real-world instances up to 800 data points.
arXiv Detail & Related papers (2021-11-30T17:08:53Z) - Max-Min Grouped Bandits [48.62520520818357]
We introduce a multi-armed bandit problem termed max-min grouped bandits.
The goal is to find a group whose worst arm has the highest mean reward.
This problem is of interest to applications such as recommendation systems.
arXiv Detail & Related papers (2021-11-17T01:59:15Z) - Exact and Approximate Hierarchical Clustering Using A* [51.187990314731344]
We introduce a new approach based on A* search for clustering.
We overcome the prohibitively large search space by combining A* with a novel emphtrellis data structure.
We empirically demonstrate that our method achieves substantially higher quality results than baselines for a particle physics use case and other clustering benchmarks.
arXiv Detail & Related papers (2021-04-14T18:15:27Z) - Fairness, Semi-Supervised Learning, and More: A General Framework for
Clustering with Stochastic Pairwise Constraints [32.19047459493177]
We introduce a novel family of emphstochastic pairwise constraints, which we incorporate into several essential clustering objectives.
We show that these constraints can succinctly model an intriguing collection of applications, including emphIndividual Fairness in clustering and emphMust-link constraints in semi-supervised learning.
arXiv Detail & Related papers (2021-03-02T20:27:58Z) - Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed.
We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z) - Computationally efficient sparse clustering [67.95910835079825]
We provide a finite sample analysis of a new clustering algorithm based on PCA.
We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
arXiv Detail & Related papers (2020-05-21T17:51:30Z) - Robust Optimization for Fairness with Noisy Protected Groups [85.13255550021495]
We study the consequences of naively relying on noisy protected group labels.
We introduce two new approaches using robust optimization.
We show that the robust approaches achieve better true group fairness guarantees than the naive approach.
arXiv Detail & Related papers (2020-02-21T14:58:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.