Related papers: Discovery data topology with the closure structure. Theoretical and practical aspects

Discovery data topology with the closure structure. Theoretical and practical aspects

URL: http://arxiv.org/abs/2010.02628v3
Date: Tue, 30 Mar 2021 08:30:16 GMT
Title: Discovery data topology with the closure structure. Theoretical and practical aspects
Authors: Tatiana Makhalova, Aleksey Buzmakov, Sergei O. Kuznetsov and Amedeo Napoli
Abstract summary: We introduce a concise representation -- the closure structure -- based on closed itemsets and their minimum generators. We propose a formalization of the closure structure in terms of Formal Concept Analysis. We present and demonstrate theoretical results, and as well, practical results using the GDPM algorithm.
Score: 21.70710923045654
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we are revisiting pattern mining and especially itemset mining, which allows one to analyze binary datasets in searching for interesting and meaningful association rules and respective itemsets in an unsupervised way. While a summarization of a dataset based on a set of patterns does not provide a general and satisfying view over a dataset, we introduce a concise representation -- the closure structure -- based on closed itemsets and their minimum generators, for capturing the intrinsic content of a dataset. The closure structure allows one to understand the topology of the dataset in the whole and the inherent complexity of the data. We propose a formalization of the closure structure in terms of Formal Concept Analysis, which is well adapted to study this data topology. We present and demonstrate theoretical results, and as well, practical results using the GDPM algorithm. GDPM is rather unique in its functionality as it returns a characterization of the topology of a dataset in terms of complexity levels, highlighting the diversity and the distribution of the itemsets. Finally, a series of experiments shows how GDPM can be practically used and what can be expected from the output.

Related papers

AlphaFold Database Debiasing for Robust Inverse Folding [58.792020809180336]
We introduce a Debiasing Structure AutoEncoder (DeSAE) that learns to reconstruct native-like conformations from intentionally corrupted backbone geometries.<n>At inference, applying DeSAE to AFDB structures produces debiased structures that significantly improve inverse folding performance.
arXiv Detail & Related papers (2025-06-10T02:25:31Z)
Space of Data through the Lens of Multilevel Graph [0.0]
This work seeks to tackle the inherent complexity of dataspaces by introducing a novel data structure. We propose the concept of a multilevel graph, which is equipped with two fundamental operations: contraction and expansion of its topology. We provide a comprehensive suite of methods for manipulating this graph structure, establishing a robust framework for data analysis.
arXiv Detail & Related papers (2025-03-30T21:54:07Z)
Structural Entropy Guided Probabilistic Coding [52.01765333755793]
We propose a novel structural entropy-guided probabilistic coding model, named SEPC. We incorporate the relationship between latent variables into the optimization by proposing a structural entropy regularization loss. Experimental results across 12 natural language understanding tasks, including both classification and regression tasks, demonstrate the superior performance of SEPC.
arXiv Detail & Related papers (2024-12-12T00:37:53Z)
Exploiting Formal Concept Analysis for Data Modeling in Data Lakes [0.29998889086656577]
This paper introduces a practical data visualization and analysis approach rooted in Formal Concept Analysis (FCA) We represent data structures as objects, analyze the concept lattice, and present two strategies-top-down and bottom-up-to unify these structures and establish a common schema. We achieve a complete coverage of 80 percent of data structures with only 34 distinct field names.
arXiv Detail & Related papers (2024-08-11T13:58:31Z)
Tree-based variational inference for Poisson log-normal models [47.82745603191512]
hierarchical trees are often used to organize entities based on proximity criteria. Current count-data models do not leverage this structured information. We introduce the PLN-Tree model as an extension of the PLN model for modeling hierarchical count data.
arXiv Detail & Related papers (2024-06-25T08:24:35Z)
Topological Quality of Subsets via Persistence Matching Diagrams [0.196629787330046]
We measure the quality of a subset concerning the dataset it represents using topological data analysis techniques. In particular, this approach enables us to explain why the chosen subset is likely to result in poor performance of a supervised learning model.
arXiv Detail & Related papers (2023-06-04T17:08:41Z)
Hierarchical clustering with dot products recovers hidden tree structure [53.68551192799585]
In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure. We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance. We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model.
arXiv Detail & Related papers (2023-05-24T11:05:12Z)
Unsupervised hierarchical clustering using the learning dynamics of RBMs [0.0]
We present a new and general method for building relational data trees by exploiting the learning dynamics of the Restricted Boltzmann Machine (RBM) Our method is based on the mean-field approach, derived from the Plefka expansion, and developed in context of disordered systems. We tested our method in an artificially hierarchical dataset and on three different real-world datasets (images of digits, mutations in the human genome, and a family of proteins)
arXiv Detail & Related papers (2023-02-03T16:53:32Z)
Topological Learning in Multi-Class Data Sets [0.3050152425444477]
We study the impact of topological complexity on learning in feedforward deep neural networks (DNNs) We evaluate our topological classification algorithm on multiple constructed and open source data sets.
arXiv Detail & Related papers (2023-01-23T21:54:25Z)
Feature construction using explanations of individual predictions [0.0]
We propose a novel approach for reducing the search space based on aggregation of instance-based explanations of predictive models. We empirically show that reducing the search to these groups significantly reduces the time of feature construction. We show significant improvements in classification accuracy for several classifiers and demonstrate the feasibility of the proposed feature construction even for large datasets.
arXiv Detail & Related papers (2023-01-23T18:59:01Z)
Amortized Inference for Causal Structure Learning [72.84105256353801]
Learning causal structure poses a search problem that typically involves evaluating structures using a score or independence test. We train a variational inference model to predict the causal structure from observational/interventional data. Our models exhibit robust generalization capabilities under substantial distribution shift.
arXiv Detail & Related papers (2022-05-25T17:37:08Z)
Structural Learning of Probabilistic Sentential Decision Diagrams under Partial Closed-World Assumption [127.439030701253]
Probabilistic sentential decision diagrams are a class of structured-decomposable circuits. We propose a new scheme based on a partial closed-world assumption: data implicitly provide the logical base of the circuit. Preliminary experiments show that the proposed approach might properly fit training data, and generalize well to test data, provided that these remain consistent with the underlying logical base.
arXiv Detail & Related papers (2021-07-26T12:01:56Z)
CDEvalSumm: An Empirical Study of Cross-Dataset Evaluation for Neural Summarization Systems [121.78477833009671]
We investigate the performance of different summarization models under a cross-dataset setting. A comprehensive study of 11 representative summarization systems on 5 datasets from different domains reveals the effect of model architectures and generation ways.
arXiv Detail & Related papers (2020-10-11T02:19:15Z)
New advances in enumerative biclustering algorithms with online partitioning [80.22629846165306]
This paper further extends RIn-Close_CVC, a biclustering algorithm capable of performing an efficient, complete, correct and non-redundant enumeration of maximal biclusters with constant values on columns in numerical datasets. The improved algorithm is called RIn-Close_CVC3, keeps those attractive properties of RIn-Close_CVC, and is characterized by: a drastic reduction in memory usage; a consistent gain in runtime.
arXiv Detail & Related papers (2020-03-07T14:54:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.