Related papers: Optimization of Inter-group Criteria for Clustering with Minimum Size Constraints

Optimization of Inter-group Criteria for Clustering with Minimum Size Constraints

URL: http://arxiv.org/abs/2401.07091v1
Date: Sat, 13 Jan 2024 14:59:12 GMT
Title: Optimization of Inter-group Criteria for Clustering with Minimum Size Constraints
Authors: Eduardo S. Laber and Lucas Murtinho
Abstract summary: Internal measures that are used to assess the quality of a clustering usually take into account intra-group and/or inter-group criteria. We propose algorithms with provable approximation guarantees for optimizing the former. We present an empirical study with 10 real datasets, providing evidence that our methods work very well in practical settings.
Score: 2.7195102129095003
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Internal measures that are used to assess the quality of a clustering usually take into account intra-group and/or inter-group criteria. There are many papers in the literature that propose algorithms with provable approximation guarantees for optimizing the former. However, the optimization of inter-group criteria is much less understood. Here, we contribute to the state-of-the-art of this literature by devising algorithms with provable guarantees for the maximization of two natural inter-group criteria, namely the minimum spacing and the minimum spanning tree spacing. The former is the minimum distance between points in different groups while the latter captures separability through the cost of the minimum spanning tree that connects all groups. We obtain results for both the unrestricted case, in which no constraint on the clusters is imposed, and for the constrained case where each group is required to have a minimum number of points. Our constraint is motivated by the fact that the popular Single Linkage, which optimizes both criteria in the unrestricted case, produces clusterings with many tiny groups. To complement our work, we present an empirical study with 10 real datasets, providing evidence that our methods work very well in practical settings.

Related papers

K*-Means: A Parameter-free Clustering Algorithm [55.20132267309382]
k*-means is a novel clustering algorithm that eliminates the need to set k or any other parameters.<n>It uses the minimum description length principle to automatically determine the optimal number of clusters, k*, by splitting and merging clusters.<n>We prove that k*-means is guaranteed to converge and demonstrate experimentally that it significantly outperforms existing methods in scenarios where k is unknown.
arXiv Detail & Related papers (2025-05-17T08:41:07Z)
MOSIC: Model-Agnostic Optimal Subgroup Identification with Multi-Constraint for Improved Reliability [11.997050225896679]
We propose a unified optimization framework that directly solves the primal constrained optimization problem to identify optimal subgroups.<n>Our key innovation is a reformulation of the constrained primal problem as an unconstrained differentiable min-max objective, solved via a gradient descent-ascent algorithm.<n>The framework is model-agnostic, compatible with a wide range of CATE estimators, and propensity to additional constraints like cost limits or fairness criteria.
arXiv Detail & Related papers (2025-04-29T16:25:23Z)
Fair Clustering for Data Summarization: Improved Approximation Algorithms and Complexity Insights [16.120911591795295]
In some applications all data points can be chosen as centers, in the general setting, centers must be chosen from a subset of points, referred as facilities or suppliers. In this work, we focus on fair data summarization modeled as the fair $k$-supplier problem, where data consists of several groups, and a minimum number of centers must be selected from each group.
arXiv Detail & Related papers (2024-10-16T18:00:19Z)
Fair Minimum Representation Clustering via Integer Programming [0.6906005491572401]
Clustering is an unsupervised learning task that aims to partition data into a set of clusters. In this paper, we study the k-means and k-medians clustering problems with the additional constraint that each group must have a minimum level of representation. We present an alternating minimization algorithm, called MiniReL, that directly incorporates the fairness constraints.
arXiv Detail & Related papers (2024-09-04T00:13:40Z)
Optimal Multi-Distribution Learning [88.3008613028333]
Multi-distribution learning seeks to learn a shared model that minimizes the worst-case risk across $k$ distinct data distributions. We propose a novel algorithm that yields an varepsilon-optimal randomized hypothesis with a sample complexity on the order of (d+k)/varepsilon2.
arXiv Detail & Related papers (2023-12-08T16:06:29Z)
Fair Minimum Representation Clustering [0.0]
Clustering is an unsupervised learning task that aims to partition data into a set of clusters. We show that the popular $k$-means algorithm, Lloyd's algorithm, can result in unfair outcomes. We present a variant of Lloyd's algorithm, called MiniReL, that directly incorporates the fairness constraints.
arXiv Detail & Related papers (2023-02-06T23:16:38Z)
An Exact Algorithm for Semi-supervised Minimum Sum-of-Squares Clustering [0.5801044612920815]
We present a new branch-and-bound algorithm for semi-supervised MSSC. Background knowledge is incorporated as pairwise must-link and cannot-link constraints. For the first time, the proposed global optimization algorithm efficiently manages to solve real-world instances up to 800 data points.
arXiv Detail & Related papers (2021-11-30T17:08:53Z)
Max-Min Grouped Bandits [48.62520520818357]
We introduce a multi-armed bandit problem termed max-min grouped bandits. The goal is to find a group whose worst arm has the highest mean reward. This problem is of interest to applications such as recommendation systems.
arXiv Detail & Related papers (2021-11-17T01:59:15Z)
Exact and Approximate Hierarchical Clustering Using A* [51.187990314731344]
We introduce a new approach based on A* search for clustering. We overcome the prohibitively large search space by combining A* with a novel emphtrellis data structure. We empirically demonstrate that our method achieves substantially higher quality results than baselines for a particle physics use case and other clustering benchmarks.
arXiv Detail & Related papers (2021-04-14T18:15:27Z)
Fairness, Semi-Supervised Learning, and More: A General Framework for Clustering with Stochastic Pairwise Constraints [32.19047459493177]
We introduce a novel family of emphstochastic pairwise constraints, which we incorporate into several essential clustering objectives. We show that these constraints can succinctly model an intriguing collection of applications, including emphIndividual Fairness in clustering and emphMust-link constraints in semi-supervised learning.
arXiv Detail & Related papers (2021-03-02T20:27:58Z)
Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed. We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z)
Computationally efficient sparse clustering [67.95910835079825]
We provide a finite sample analysis of a new clustering algorithm based on PCA. We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
arXiv Detail & Related papers (2020-05-21T17:51:30Z)
Robust Optimization for Fairness with Noisy Protected Groups [85.13255550021495]
We study the consequences of naively relying on noisy protected group labels. We introduce two new approaches using robust optimization. We show that the robust approaches achieve better true group fairness guarantees than the naive approach.
arXiv Detail & Related papers (2020-02-21T14:58:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.