New advances in enumerative biclustering algorithms with online
partitioning
- URL: http://arxiv.org/abs/2003.04726v1
- Date: Sat, 7 Mar 2020 14:54:26 GMT
- Title: New advances in enumerative biclustering algorithms with online
partitioning
- Authors: Rosana Veroneze and Fernando J. Von Zuben
- Abstract summary: This paper further extends RIn-Close_CVC, a biclustering algorithm capable of performing an efficient, complete, correct and non-redundant enumeration of maximal biclusters with constant values on columns in numerical datasets.
The improved algorithm is called RIn-Close_CVC3, keeps those attractive properties of RIn-Close_CVC, and is characterized by: a drastic reduction in memory usage; a consistent gain in runtime.
- Score: 80.22629846165306
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper further extends RIn-Close_CVC, a biclustering algorithm capable of
performing an efficient, complete, correct and non-redundant enumeration of
maximal biclusters with constant values on columns in numerical datasets. By
avoiding a priori partitioning and itemization of the dataset, RIn-Close_CVC
implements an online partitioning, which is demonstrated here to guide to more
informative biclustering results. The improved algorithm is called
RIn-Close_CVC3, keeps those attractive properties of RIn-Close_CVC, as formally
proved here, and is characterized by: a drastic reduction in memory usage; a
consistent gain in runtime; additional ability to handle datasets with missing
values; and additional ability to operate with attributes characterized by
distinct distributions or even mixed data types. The experimental results
include synthetic and real-world datasets used to perform scalability and
sensitivity analyses. As a practical case study, a parsimonious set of relevant
and interpretable mixed-attribute-type rules is obtained in the context of
supervised descriptive pattern mining.
Related papers
- Spectral Clustering of Categorical and Mixed-type Data via Extra Graph
Nodes [0.0]
This paper explores a more natural way to incorporate both numerical and categorical information into the spectral clustering algorithm.
We propose adding extra nodes corresponding to the different categories the data may belong to and show that it leads to an interpretable clustering objective function.
We demonstrate that this simple framework leads to a linear-time spectral clustering algorithm for categorical-only data.
arXiv Detail & Related papers (2024-03-08T20:49:49Z) - Feature construction using explanations of individual predictions [0.0]
We propose a novel approach for reducing the search space based on aggregation of instance-based explanations of predictive models.
We empirically show that reducing the search to these groups significantly reduces the time of feature construction.
We show significant improvements in classification accuracy for several classifiers and demonstrate the feasibility of the proposed feature construction even for large datasets.
arXiv Detail & Related papers (2023-01-23T18:59:01Z) - Detection and Evaluation of Clusters within Sequential Data [58.720142291102135]
Clustering algorithms for Block Markov Chains possess theoretical optimality guarantees.
In particular, our sequential data is derived from human DNA, written text, animal movement data and financial markets.
It is found that the Block Markov Chain model assumption can indeed produce meaningful insights in exploratory data analyses.
arXiv Detail & Related papers (2022-10-04T15:22:39Z) - Dataset Complexity Assessment Based on Cumulative Maximum Scaled Area
Under Laplacian Spectrum [38.65823547986758]
It is meaningful to predict classification performance by assessing the complexity of datasets effectively before training DCNN models.
This paper proposes a novel method called cumulative maximum scaled Area Under Laplacian Spectrum (cmsAULS)
arXiv Detail & Related papers (2022-09-29T13:02:04Z) - Random projections and Kernelised Leave One Cluster Out
Cross-Validation: Universal baselines and evaluation tools for supervised
machine learning for materials properties [10.962094053749093]
leave one cluster out cross validation (LOCO-CV) has been introduced as a way of measuring the performance of an algorithm in predicting previously unseen groups of materials.
We present a thorough comparison between composition-based representations, and investigate how kernel approximation functions can be used to enhance LOCO-CV applications.
We find that domain knowledge does not improve machine learning performance in most tasks tested, with band gap prediction being the notable exception.
arXiv Detail & Related papers (2022-06-17T15:39:39Z) - Adaptive Attribute and Structure Subspace Clustering Network [49.040136530379094]
We propose a novel self-expressiveness-based subspace clustering network.
We first consider an auto-encoder to represent input data samples.
Then, we construct a mixed signed and symmetric structure matrix to capture the local geometric structure underlying data.
We perform self-expressiveness on the constructed attribute structure and matrices to learn their affinity graphs.
arXiv Detail & Related papers (2021-09-28T14:00:57Z) - Sparse PCA via $l_{2,p}$-Norm Regularization for Unsupervised Feature
Selection [138.97647716793333]
We propose a simple and efficient unsupervised feature selection method, by combining reconstruction error with $l_2,p$-norm regularization.
We present an efficient optimization algorithm to solve the proposed unsupervised model, and analyse the convergence and computational complexity of the algorithm theoretically.
arXiv Detail & Related papers (2020-12-29T04:08:38Z) - Contrastive Clustering [57.71729650297379]
We propose Contrastive Clustering (CC) which explicitly performs the instance- and cluster-level contrastive learning.
In particular, CC achieves an NMI of 0.705 (0.431) on the CIFAR-10 (CIFAR-100) dataset, which is an up to 19% (39%) performance improvement compared with the best baseline.
arXiv Detail & Related papers (2020-09-21T08:54:40Z) - SECODA: Segmentation- and Combination-Based Detection of Anomalies [0.0]
SECODA is an unsupervised non-parametric anomaly detection algorithm for datasets containing continuous and categorical attributes.
The algorithm has a low memory imprint and its runtime performance scales linearly with the size of the dataset.
An evaluation with simulated and real-life datasets shows that this algorithm is able to identify many different types of anomalies.
arXiv Detail & Related papers (2020-08-16T10:03:14Z) - Unsupervised Heterogeneous Coupling Learning for Categorical
Representation [50.1603042640492]
This work introduces a UNsupervised heTerogeneous couplIng lEarning (UNTIE) approach for representing coupled categorical data by untying the interactions between couplings.
UNTIE is efficiently optimized w.r.t. a kernel k-means objective function for unsupervised representation learning of heterogeneous and hierarchical value-to-object couplings.
The UNTIE-learned representations make significant performance improvement against the state-of-the-art categorical representations and deep representation models.
arXiv Detail & Related papers (2020-07-21T11:23:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.