Generating Multidimensional Clusters With Support Lines
- URL: http://arxiv.org/abs/2301.10327v3
- Date: Mon, 4 Mar 2024 20:46:37 GMT
- Title: Generating Multidimensional Clusters With Support Lines
- Authors: Nuno Fachada, Diogo de Andrade
- Abstract summary: We present Clugen, a modular procedure for synthetic data generation.
Cluken is open source, comprehensively unit tested and documented.
We demonstrate that Clugen is fit for use in the assessment of clustering algorithms.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Synthetic data is essential for assessing clustering techniques,
complementing and extending real data, and allowing for more complete coverage
of a given problem's space. In turn, synthetic data generators have the
potential of creating vast amounts of data -- a crucial activity when
real-world data is at premium -- while providing a well-understood generation
procedure and an interpretable instrument for methodically investigating
cluster analysis algorithms. Here, we present Clugen, a modular procedure for
synthetic data generation, capable of creating multidimensional clusters
supported by line segments using arbitrary distributions. Clugen is open
source, comprehensively unit tested and documented, and is available for the
Python, R, Julia, and MATLAB/Octave ecosystems. We demonstrate that our
proposal can produce rich and varied results in various dimensions, is fit for
use in the assessment of clustering algorithms, and has the potential to be a
widely used framework in diverse clustering-related research tasks.
Related papers
- ClusterGraph: a new tool for visualization and compression of multidimensional data [0.0]
This paper provides an additional layer on the output of any clustering algorithm.
It provides information about the global layout of clusters, obtained from the considered clustering algorithm.
arXiv Detail & Related papers (2024-11-08T09:40:54Z) - TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations.
We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z) - Deep Clustering: A Comprehensive Survey [53.387957674512585]
Clustering analysis plays an indispensable role in machine learning and data mining.
Deep clustering, which can learn clustering-friendly representations using deep neural networks, has been broadly applied in a wide range of clustering tasks.
Existing surveys for deep clustering mainly focus on the single-view fields and the network architectures, ignoring the complex application scenarios of clustering.
arXiv Detail & Related papers (2022-10-09T02:31:32Z) - Detection and Evaluation of Clusters within Sequential Data [58.720142291102135]
Clustering algorithms for Block Markov Chains possess theoretical optimality guarantees.
In particular, our sequential data is derived from human DNA, written text, animal movement data and financial markets.
It is found that the Block Markov Chain model assumption can indeed produce meaningful insights in exploratory data analyses.
arXiv Detail & Related papers (2022-10-04T15:22:39Z) - Enhancing cluster analysis via topological manifold learning [0.3823356975862006]
We show that inferring the topological structure of a dataset before clustering can considerably enhance cluster detection.
We combine manifold learning method UMAP for inferring the topological structure with density-based clustering method DBSCAN.
arXiv Detail & Related papers (2022-07-01T15:53:39Z) - A Multiscale Environment for Learning by Diffusion [9.619814126465206]
We introduce the Multiscale Environment for Learning by Diffusion (MELD) data model.
We show that the MELD data model precisely captures latent multiscale structure in data and facilitates its analysis.
To efficiently learn the multiscale structure observed in many real datasets, we introduce the Multiscale Learning by Unsupervised Diffusion (M-LUND) clustering algorithm.
arXiv Detail & Related papers (2021-01-31T17:46:19Z) - Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed.
We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z) - Kernel learning approaches for summarising and combining posterior
similarity matrices [68.8204255655161]
We build upon the notion of the posterior similarity matrix (PSM) in order to suggest new approaches for summarising the output of MCMC algorithms for Bayesian clustering models.
A key contribution of our work is the observation that PSMs are positive semi-definite, and hence can be used to define probabilistically-motivated kernel matrices.
arXiv Detail & Related papers (2020-09-27T14:16:14Z) - Elastic Coupled Co-clustering for Single-Cell Genomic Data [0.0]
Single-cell technologies have enabled us to profile genomic features at unprecedented resolution.
Data integration can potentially lead to a better performance of clustering algorithms.
In this work, we formulate the problem in an unsupervised transfer learning framework.
arXiv Detail & Related papers (2020-03-29T08:21:53Z) - New advances in enumerative biclustering algorithms with online
partitioning [80.22629846165306]
This paper further extends RIn-Close_CVC, a biclustering algorithm capable of performing an efficient, complete, correct and non-redundant enumeration of maximal biclusters with constant values on columns in numerical datasets.
The improved algorithm is called RIn-Close_CVC3, keeps those attractive properties of RIn-Close_CVC, and is characterized by: a drastic reduction in memory usage; a consistent gain in runtime.
arXiv Detail & Related papers (2020-03-07T14:54:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.