A column generation algorithm with dynamic constraint aggregation for minimum sum-of-squares clustering
- URL: http://arxiv.org/abs/2410.06187v1
- Date: Tue, 8 Oct 2024 16:51:28 GMT
- Title: A column generation algorithm with dynamic constraint aggregation for minimum sum-of-squares clustering
- Authors: Antonio M. Sudoso, Daniel Aloise,
- Abstract summary: The minimum sum-of-squares clustering problem (MSSC) refers to the problem of partitioning $n$ data points into $k$ clusters.
We propose an efficient algorithm for solving large-scale MSSC instances, which combines column generation (CG) with dynamic constraint aggregation (DCA)
- Score: 0.30693357740321775
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The minimum sum-of-squares clustering problem (MSSC), also known as $k$-means clustering, refers to the problem of partitioning $n$ data points into $k$ clusters, with the objective of minimizing the total sum of squared Euclidean distances between each point and the center of its assigned cluster. We propose an efficient algorithm for solving large-scale MSSC instances, which combines column generation (CG) with dynamic constraint aggregation (DCA) to effectively reduce the number of constraints considered in the CG master problem. DCA was originally conceived to reduce degeneracy in set partitioning problems by utilizing an aggregated restricted master problem obtained from a partition of the set partitioning constraints into disjoint clusters. In this work, we explore the use of DCA within a CG algorithm for MSSC exact solution. Our method is fine-tuned by a series of ablation studies on DCA design choices, and is demonstrated to significantly outperform existing state-of-the-art exact approaches available in the literature.
Related papers
- Strong bounds for large-scale Minimum Sum-of-Squares Clustering [0.9831489366502302]
Minimum Sum-of-Squares Clustering (MSSC) is one of the most widely used clustering methods.
MSSC aims to minimize the total squared Euclidean distance between data points and their corresponding cluster centroids.
We introduce a novel method to validate MSSC solutions through optimality gaps.
arXiv Detail & Related papers (2025-02-12T13:40:00Z) - A Greedy Strategy for Graph Cut [95.2841574410968]
We propose a greedy strategy to solve the problem of Graph Cut, called GGC.
It starts from the state where each data sample is regarded as a cluster and dynamically merges the two clusters.
GGC has a nearly linear computational complexity with respect to the number of samples.
arXiv Detail & Related papers (2024-12-28T05:49:42Z) - Self-Supervised Graph Embedding Clustering [70.36328717683297]
K-means one-step dimensionality reduction clustering method has made some progress in addressing the curse of dimensionality in clustering tasks.
We propose a unified framework that integrates manifold learning with K-means, resulting in the self-supervised graph embedding framework.
arXiv Detail & Related papers (2024-09-24T08:59:51Z) - A3S: A General Active Clustering Method with Pairwise Constraints [66.74627463101837]
A3S features strategic active clustering adjustment on the initial cluster result, which is obtained by an adaptive clustering algorithm.
In extensive experiments across diverse real-world datasets, A3S achieves desired results with significantly fewer human queries.
arXiv Detail & Related papers (2024-07-14T13:37:03Z) - Memetic Differential Evolution Methods for Semi-Supervised Clustering [0.8681835475119588]
We propose an extension for semi-supervised Minimum Sum-of-Squares Clustering (MSSC) problems of MDEClust.
Our new framework, called S-MDEClust, represents the first memetic methodology designed to generate an optimal feasible solution.
arXiv Detail & Related papers (2024-03-07T08:37:36Z) - Gap-Free Clustering: Sensitivity and Robustness of SDP [6.996002801232415]
We study graph clustering in the Block Model (SBM) in the presence of both large clusters and small, unrecoverable clusters.
Previous convex relaxation approaches achieving exact recovery do not allow any small clusters of size $o(sqrtn)$, or require a size gap between the smallest recovered cluster and the largest non-recovered cluster.
We provide an algorithm based on semidefinite programming (SDP) which removes these requirements and provably recovers large clusters regardless of the remaining cluster sizes.
arXiv Detail & Related papers (2023-08-29T21:27:21Z) - Revisiting Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model [69.15976031704687]
We propose IAC (Instance-Adaptive Clustering), the first algorithm whose performance matches the instance-specific lower bounds both in expectation and with high probability.
IAC maintains an overall computational complexity of $ mathcalO(n, textpolylog(n) $, making it scalable and practical for large-scale problems.
arXiv Detail & Related papers (2023-06-18T08:46:06Z) - Global Optimization for Cardinality-constrained Minimum Sum-of-Squares
Clustering via Semidefinite Programming [1.3053649021965603]
The minimum sum-of-squares clustering (MSSC) has been recently extended to exploit prior knowledge on the cardinality of each cluster.
We propose a global optimization approach based on the branch-and-cut technique to solve the cardinality-constrained MSSC.
For the upper bound, instead, we present a local search procedure that exploits the solution of the SDP relaxation solved at each node.
arXiv Detail & Related papers (2022-09-19T10:19:06Z) - Gradient Based Clustering [72.15857783681658]
We propose a general approach for distance based clustering, using the gradient of the cost function that measures clustering quality.
The approach is an iterative two step procedure (alternating between cluster assignment and cluster center updates) and is applicable to a wide range of functions.
arXiv Detail & Related papers (2022-02-01T19:31:15Z) - Lattice-Based Methods Surpass Sum-of-Squares in Clustering [98.46302040220395]
Clustering is a fundamental primitive in unsupervised learning.
Recent work has established lower bounds against the class of low-degree methods.
We show that, perhaps surprisingly, this particular clustering model textitdoes not exhibit a statistical-to-computational gap.
arXiv Detail & Related papers (2021-12-07T18:50:17Z) - An Exact Algorithm for Semi-supervised Minimum Sum-of-Squares Clustering [0.5801044612920815]
We present a new branch-and-bound algorithm for semi-supervised MSSC.
Background knowledge is incorporated as pairwise must-link and cannot-link constraints.
For the first time, the proposed global optimization algorithm efficiently manages to solve real-world instances up to 800 data points.
arXiv Detail & Related papers (2021-11-30T17:08:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.