Clust-Splitter $-$ an Efficient Nonsmooth Optimization-Based Algorithm for Clustering Large Datasets
- URL: http://arxiv.org/abs/2505.04389v1
- Date: Wed, 07 May 2025 13:13:46 GMT
- Title: Clust-Splitter $-$ an Efficient Nonsmooth Optimization-Based Algorithm for Clustering Large Datasets
- Authors: Jenni Lampainen, Kaisa Joki, Napsu Karmitsa, Marko M. Mäkelä,
- Abstract summary: We introduce Clust-Splitter, an efficient algorithm based on nonsmooth optimization, to solve the minimum sum-of-squares clustering problem.<n>We evaluate Clust-Splitter on real-world datasets characterized by both a large number of attributes and a large number of data points.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Clustering is a fundamental task in data mining and machine learning, particularly for analyzing large-scale data. In this paper, we introduce Clust-Splitter, an efficient algorithm based on nonsmooth optimization, designed to solve the minimum sum-of-squares clustering problem in very large datasets. The clustering task is approached through a sequence of three nonsmooth optimization problems: two auxiliary problems used to generate suitable starting points, followed by a main clustering formulation. To solve these problems effectively, the limited memory bundle method is combined with an incremental approach to develop the Clust-Splitter algorithm. We evaluate Clust-Splitter on real-world datasets characterized by both a large number of attributes and a large number of data points and compare its performance with several state-of-the-art large-scale clustering algorithms. Experimental results demonstrate the efficiency of the proposed method for clustering very large datasets, as well as the high quality of its solutions, which are on par with those of the best existing methods.
Related papers
- Exact and Heuristic Algorithms for Constrained Biclustering [0.0]
Biclustering, also known as co-clustering or two-way clustering, simultaneously partitions the rows and columns of a data matrix to reveal submatrices with coherent patterns.<n>We study constrained biclustering with pairwise constraints, namely must-link and cannot-link constraints, which specify whether objects should belong to the same or different biclusters.
arXiv Detail & Related papers (2025-08-07T15:29:22Z) - Strong bounds for large-scale Minimum Sum-of-Squares Clustering [0.9831489366502302]
Minimum Sum-of-Squares Clustering (MSSC) is one of the most widely used clustering methods.<n>MSSC aims to minimize the total squared Euclidean distance between data points and their corresponding cluster centroids.<n>We introduce a novel method to validate MSSC solutions through optimality gaps.
arXiv Detail & Related papers (2025-02-12T13:40:00Z) - Scalable Co-Clustering for Large-Scale Data through Dynamic Partitioning and Hierarchical Merging [7.106620444966807]
Co-clustering simultaneously clusters rows and columns, revealing more fine-grained groups.<n>Existing co-clustering methods suffer from poor scalability and cannot handle large-scale data.<n>This paper presents a novel and scalable co-clustering method designed to uncover intricate patterns in high-dimensional, large-scale datasets.
arXiv Detail & Related papers (2024-10-09T04:47:22Z) - A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data.
We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z) - One-step Multi-view Clustering with Diverse Representation [47.41455937479201]
We propose a one-step multi-view clustering with diverse representation method, which incorporates multi-view learning and $k$-means into a unified framework.
We develop an efficient optimization algorithm with proven convergence to solve the resultant problem.
arXiv Detail & Related papers (2023-06-08T02:52:24Z) - Research on Efficient Fuzzy Clustering Method Based on Local Fuzzy
Granular balls [67.33923111887933]
In this paper, the data is fuzzy iterated using granular-balls, and the membership degree of data only considers the two granular-balls where it is located.
The formed fuzzy granular-balls set can use more processing methods in the face of different data scenarios.
arXiv Detail & Related papers (2023-03-07T01:52:55Z) - Late Fusion Multi-view Clustering via Global and Local Alignment
Maximization [61.89218392703043]
Multi-view clustering (MVC) optimally integrates complementary information from different views to improve clustering performance.
Most of existing approaches directly fuse multiple pre-specified similarities to learn an optimal similarity matrix for clustering.
We propose late fusion MVC via alignment to address these issues.
arXiv Detail & Related papers (2022-08-02T01:49:31Z) - A sampling-based approach for efficient clustering in large datasets [0.8952229340927184]
We propose a simple and efficient clustering method for high-dimensional data with a large number of clusters.
Our contribution is substantially more efficient than k-means as it does not require an all to all comparison of data points and clusters.
arXiv Detail & Related papers (2021-12-29T19:15:20Z) - LSEC: Large-scale spectral ensemble clustering [8.545202841051582]
We propose a large-scale spectral ensemble clustering (LSEC) method to strike a good balance between efficiency and effectiveness.
The LSEC method achieves a lower computational complexity than most existing ensemble clustering methods.
arXiv Detail & Related papers (2021-06-18T00:42:03Z) - Multi-View Spectral Clustering with High-Order Optimal Neighborhood
Laplacian Matrix [57.11971786407279]
Multi-view spectral clustering can effectively reveal the intrinsic cluster structure among data.
This paper proposes a multi-view spectral clustering algorithm that learns a high-order optimal neighborhood Laplacian matrix.
Our proposed algorithm generates the optimal Laplacian matrix by searching the neighborhood of the linear combination of both the first-order and high-order base.
arXiv Detail & Related papers (2020-08-31T12:28:40Z) - Bi-objective Optimization of Biclustering with Binary Data [0.0]
Clustering consists of partitioning data objects into subsets called clusters according to some similarity criteria.
This paper addresses a quasi-clustering that allows overlapping of clusters, and which we link to biclustering.
Biclustering simultaneously groups the objects and features so that a specific group of objects has a special group of features.
arXiv Detail & Related papers (2020-02-09T21:49:26Z) - Clustering Binary Data by Application of Combinatorial Optimization
Heuristics [52.77024349608834]
We study clustering methods for binary data, first defining aggregation criteria that measure the compactness of clusters.
Five new and original methods are introduced, using neighborhoods and population behavior optimization metaheuristics.
From a set of 16 data tables generated by a quasi-Monte Carlo experiment, a comparison is performed for one of the aggregations using L1 dissimilarity, with hierarchical clustering, and a version of k-means: partitioning around medoids or PAM.
arXiv Detail & Related papers (2020-01-06T23:33:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.