Discovery data topology with the closure structure. Theoretical and
practical aspects
- URL: http://arxiv.org/abs/2010.02628v3
- Date: Tue, 30 Mar 2021 08:30:16 GMT
- Title: Discovery data topology with the closure structure. Theoretical and
practical aspects
- Authors: Tatiana Makhalova, Aleksey Buzmakov, Sergei O. Kuznetsov and Amedeo
Napoli
- Abstract summary: We introduce a concise representation -- the closure structure -- based on closed itemsets and their minimum generators.
We propose a formalization of the closure structure in terms of Formal Concept Analysis.
We present and demonstrate theoretical results, and as well, practical results using the GDPM algorithm.
- Score: 21.70710923045654
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we are revisiting pattern mining and especially itemset
mining, which allows one to analyze binary datasets in searching for
interesting and meaningful association rules and respective itemsets in an
unsupervised way. While a summarization of a dataset based on a set of patterns
does not provide a general and satisfying view over a dataset, we introduce a
concise representation -- the closure structure -- based on closed itemsets and
their minimum generators, for capturing the intrinsic content of a dataset. The
closure structure allows one to understand the topology of the dataset in the
whole and the inherent complexity of the data. We propose a formalization of
the closure structure in terms of Formal Concept Analysis, which is well
adapted to study this data topology. We present and demonstrate theoretical
results, and as well, practical results using the GDPM algorithm. GDPM is
rather unique in its functionality as it returns a characterization of the
topology of a dataset in terms of complexity levels, highlighting the diversity
and the distribution of the itemsets. Finally, a series of experiments shows
how GDPM can be practically used and what can be expected from the output.
Related papers
- Exploiting Formal Concept Analysis for Data Modeling in Data Lakes [0.29998889086656577]
This paper introduces a practical data visualization and analysis approach rooted in Formal Concept Analysis (FCA)
We represent data structures as objects, analyze the concept lattice, and present two strategies-top-down and bottom-up-to unify these structures and establish a common schema.
We achieve a complete coverage of 80 percent of data structures with only 34 distinct field names.
arXiv Detail & Related papers (2024-08-11T13:58:31Z) - Tree-based variational inference for Poisson log-normal models [47.82745603191512]
hierarchical trees are often used to organize entities based on proximity criteria.
Current count-data models do not leverage this structured information.
We introduce the PLN-Tree model as an extension of the PLN model for modeling hierarchical count data.
arXiv Detail & Related papers (2024-06-25T08:24:35Z) - Topological Quality of Subsets via Persistence Matching Diagrams [0.196629787330046]
We measure the quality of a subset concerning the dataset it represents using topological data analysis techniques.
In particular, this approach enables us to explain why the chosen subset is likely to result in poor performance of a supervised learning model.
arXiv Detail & Related papers (2023-06-04T17:08:41Z) - Hierarchical clustering with dot products recovers hidden tree structure [53.68551192799585]
In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure.
We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance.
We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model.
arXiv Detail & Related papers (2023-05-24T11:05:12Z) - Unsupervised hierarchical clustering using the learning dynamics of RBMs [0.0]
We present a new and general method for building relational data trees by exploiting the learning dynamics of the Restricted Boltzmann Machine (RBM)
Our method is based on the mean-field approach, derived from the Plefka expansion, and developed in context of disordered systems.
We tested our method in an artificially hierarchical dataset and on three different real-world datasets (images of digits, mutations in the human genome, and a family of proteins)
arXiv Detail & Related papers (2023-02-03T16:53:32Z) - Topological Learning in Multi-Class Data Sets [0.3050152425444477]
We study the impact of topological complexity on learning in feedforward deep neural networks (DNNs)
We evaluate our topological classification algorithm on multiple constructed and open source data sets.
arXiv Detail & Related papers (2023-01-23T21:54:25Z) - Feature construction using explanations of individual predictions [0.0]
We propose a novel approach for reducing the search space based on aggregation of instance-based explanations of predictive models.
We empirically show that reducing the search to these groups significantly reduces the time of feature construction.
We show significant improvements in classification accuracy for several classifiers and demonstrate the feasibility of the proposed feature construction even for large datasets.
arXiv Detail & Related papers (2023-01-23T18:59:01Z) - Amortized Inference for Causal Structure Learning [72.84105256353801]
Learning causal structure poses a search problem that typically involves evaluating structures using a score or independence test.
We train a variational inference model to predict the causal structure from observational/interventional data.
Our models exhibit robust generalization capabilities under substantial distribution shift.
arXiv Detail & Related papers (2022-05-25T17:37:08Z) - Structural Learning of Probabilistic Sentential Decision Diagrams under
Partial Closed-World Assumption [127.439030701253]
Probabilistic sentential decision diagrams are a class of structured-decomposable circuits.
We propose a new scheme based on a partial closed-world assumption: data implicitly provide the logical base of the circuit.
Preliminary experiments show that the proposed approach might properly fit training data, and generalize well to test data, provided that these remain consistent with the underlying logical base.
arXiv Detail & Related papers (2021-07-26T12:01:56Z) - CDEvalSumm: An Empirical Study of Cross-Dataset Evaluation for Neural
Summarization Systems [121.78477833009671]
We investigate the performance of different summarization models under a cross-dataset setting.
A comprehensive study of 11 representative summarization systems on 5 datasets from different domains reveals the effect of model architectures and generation ways.
arXiv Detail & Related papers (2020-10-11T02:19:15Z) - New advances in enumerative biclustering algorithms with online
partitioning [80.22629846165306]
This paper further extends RIn-Close_CVC, a biclustering algorithm capable of performing an efficient, complete, correct and non-redundant enumeration of maximal biclusters with constant values on columns in numerical datasets.
The improved algorithm is called RIn-Close_CVC3, keeps those attractive properties of RIn-Close_CVC, and is characterized by: a drastic reduction in memory usage; a consistent gain in runtime.
arXiv Detail & Related papers (2020-03-07T14:54:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.