Related papers: Detection and Evaluation of Clusters within Sequential Data

Detection and Evaluation of Clusters within Sequential Data

URL: http://arxiv.org/abs/2210.01679v1
Date: Tue, 4 Oct 2022 15:22:39 GMT
Title: Detection and Evaluation of Clusters within Sequential Data
Authors: Alexander Van Werde, Albert Senen-Cerda, Gianluca Kosmella, Jaron Sanders
Abstract summary: Clustering algorithms for Block Markov Chains possess theoretical optimality guarantees. In particular, our sequential data is derived from human DNA, written text, animal movement data and financial markets. It is found that the Block Markov Chain model assumption can indeed produce meaningful insights in exploratory data analyses.
Score: 58.720142291102135
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Motivated by theoretical advancements in dimensionality reduction techniques we use a recent model, called Block Markov Chains, to conduct a practical study of clustering in real-world sequential data. Clustering algorithms for Block Markov Chains possess theoretical optimality guarantees and can be deployed in sparse data regimes. Despite these favorable theoretical properties, a thorough evaluation of these algorithms in realistic settings has been lacking. We address this issue and investigate the suitability of these clustering algorithms in exploratory data analysis of real-world sequential data. In particular, our sequential data is derived from human DNA, written text, animal movement data and financial markets. In order to evaluate the determined clusters, and the associated Block Markov Chain model, we further develop a set of evaluation tools. These tools include benchmarking, spectral noise analysis and statistical model selection tools. An efficient implementation of the clustering algorithm and the new evaluation tools is made available together with this paper. Practical challenges associated to real-world data are encountered and discussed. It is ultimately found that the Block Markov Chain model assumption, together with the tools developed here, can indeed produce meaningful insights in exploratory data analyses despite the complexity and sparsity of real-world data.

Related papers

Stochastic Amortization: A Unified Approach to Accelerate Feature and Data Attribution [62.71425232332837]
We show that training amortized models with noisy labels is inexpensive and surprisingly effective. This approach significantly accelerates several feature attribution and data valuation methods, often yielding an order of magnitude speedup over existing approaches.
arXiv Detail & Related papers (2024-01-29T03:42:37Z)
Robust and Automatic Data Clustering: Dirichlet Process meets Median-of-Means [18.3248037914529]
We present an efficient and automatic clustering technique by integrating the principles of model-based and centroid-based methodologies. Statistical guarantees on the upper bound of clustering error suggest the advantages of our proposed method over existing state-of-the-art clustering algorithms.
arXiv Detail & Related papers (2023-11-26T19:01:15Z)
Generating Multidimensional Clusters With Support Lines [0.0]
We present Clugen, a modular procedure for synthetic data generation. Cluken is open source, comprehensively unit tested and documented. We demonstrate that Clugen is fit for use in the assessment of clustering algorithms.
arXiv Detail & Related papers (2023-01-24T22:08:24Z)
Seeking the Truth Beyond the Data. An Unsupervised Machine Learning Approach [0.0]
Clustering is an unsupervised machine learning methodology where unlabeled elements/objects are grouped together. This article provides a deep description of the most widely used clustering methodologies. It emphasizes the comparison of these algorithms' clustering efficiency based on 3 datasets.
arXiv Detail & Related papers (2022-07-14T14:22:36Z)
An iterative clustering algorithm for the Contextual Stochastic Block Model with optimality guarantees [4.007017852999008]
We propose a new iterative algorithm to cluster networks with side information for nodes. We show that our algorithm is optimal under the Contextual Symmetric Block Model.
arXiv Detail & Related papers (2021-12-20T12:04:07Z)
Scalable Intervention Target Estimation in Linear Models [52.60799340056917]
Current approaches to causal structure learning either work with known intervention targets or use hypothesis testing to discover the unknown intervention targets. This paper proposes a scalable and efficient algorithm that consistently identifies all intervention targets. The proposed algorithm can be used to also update a given observational Markov equivalence class into the interventional Markov equivalence class.
arXiv Detail & Related papers (2021-11-15T03:16:56Z)
A review of systematic selection of clustering algorithms and their evaluation [0.0]
This paper aims to identify a systematic selection logic for clustering algorithms and corresponding validation concepts. The goal is to enable potential users to choose an algorithm that fits best to their needs and the properties of their underlying data clustering problem.
arXiv Detail & Related papers (2021-06-24T07:01:46Z)
Network Classifiers Based on Social Learning [71.86764107527812]
We propose a new way of combining independently trained classifiers over space and time. The proposed architecture is able to improve prediction performance over time with unlabeled data. We show that this strategy results in consistent learning with high probability, and it yields a robust structure against poorly trained classifiers.
arXiv Detail & Related papers (2020-10-23T11:18:20Z)
Structured Graph Learning for Clustering and Semi-supervised Classification [74.35376212789132]
We propose a graph learning framework to preserve both the local and global structure of data. Our method uses the self-expressiveness of samples to capture the global structure and adaptive neighbor approach to respect the local structure. Our model is equivalent to a combination of kernel k-means and k-means methods under certain condition.
arXiv Detail & Related papers (2020-08-31T08:41:20Z)
A semi-supervised sparse K-Means algorithm [3.04585143845864]
An unsupervised sparse clustering method can be employed in order to detect the subgroup of features necessary for clustering. A semi-supervised method can use the labelled data to create constraints and enhance the clustering solution. We show that the algorithm maintains the high performance of other semi-supervised algorithms and in addition preserves the ability to identify informative from uninformative features.
arXiv Detail & Related papers (2020-03-16T02:05:23Z)
New advances in enumerative biclustering algorithms with online partitioning [80.22629846165306]
This paper further extends RIn-Close_CVC, a biclustering algorithm capable of performing an efficient, complete, correct and non-redundant enumeration of maximal biclusters with constant values on columns in numerical datasets. The improved algorithm is called RIn-Close_CVC3, keeps those attractive properties of RIn-Close_CVC, and is characterized by: a drastic reduction in memory usage; a consistent gain in runtime.
arXiv Detail & Related papers (2020-03-07T14:54:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.