Tell Me Something I Don't Know: Randomization Strategies for Iterative
Data Mining
- URL: http://arxiv.org/abs/2006.09467v1
- Date: Tue, 16 Jun 2020 19:20:50 GMT
- Title: Tell Me Something I Don't Know: Randomization Strategies for Iterative
Data Mining
- Authors: Sami Hanhij\"arvi, Markus Ojala, Niko Vuokko, Kai Puolam\"aki, Nikolaj
Tatti, Heikki Mannila
- Abstract summary: We consider the problem of randomizing data so that previously discovered patterns or models are taken into account.
In this paper, we consider the problem of randomizing data so that previously discovered patterns or models are taken into account.
- Score: 0.6100370338020054
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There is a wide variety of data mining methods available, and it is generally
useful in exploratory data analysis to use many different methods for the same
dataset. This, however, leads to the problem of whether the results found by
one method are a reflection of the phenomenon shown by the results of another
method, or whether the results depict in some sense unrelated properties of the
data. For example, using clustering can give indication of a clear cluster
structure, and computing correlations between variables can show that there are
many significant correlations in the data. However, it can be the case that the
correlations are actually determined by the cluster structure.
In this paper, we consider the problem of randomizing data so that previously
discovered patterns or models are taken into account. The randomization methods
can be used in iterative data mining. At each step in the data mining process,
the randomization produces random samples from the set of data matrices
satisfying the already discovered patterns or models. That is, given a data set
and some statistics (e.g., cluster centers or co-occurrence counts) of the
data, the randomization methods sample data sets having similar values of the
given statistics as the original data set. We use Metropolis sampling based on
local swaps to achieve this. We describe experiments on real data that
demonstrate the usefulness of our approach. Our results indicate that in many
cases, the results of, e.g., clustering actually imply the results of, say,
frequent pattern discovery.
Related papers
- Personalized Federated Learning via Active Sampling [50.456464838807115]
This paper proposes a novel method for sequentially identifying similar (or relevant) data generators.
Our method evaluates the relevance of a data generator by evaluating the effect of a gradient step using its local dataset.
We extend this method to non-parametric models by a suitable generalization of the gradient step to update a hypothesis using the local dataset provided by a data generator.
arXiv Detail & Related papers (2024-09-03T17:12:21Z) - Learning to Bound Counterfactual Inference in Structural Causal Models
from Observational and Randomised Data [64.96984404868411]
We derive a likelihood characterisation for the overall data that leads us to extend a previous EM-based algorithm.
The new algorithm learns to approximate the (unidentifiability) region of model parameters from such mixed data sources.
It delivers interval approximations to counterfactual results, which collapse to points in the identifiable case.
arXiv Detail & Related papers (2022-12-06T12:42:11Z) - Inv-SENnet: Invariant Self Expression Network for clustering under
biased data [17.25929452126843]
We propose a novel framework for jointly removing unwanted attributes (biases) while learning to cluster data points in individual subspaces.
Our experimental result on synthetic and real-world datasets demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2022-11-13T01:19:06Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - Equivariance Allows Handling Multiple Nuisance Variables When Analyzing
Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution.
We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z) - Combining Observational and Randomized Data for Estimating Heterogeneous
Treatment Effects [82.20189909620899]
Estimating heterogeneous treatment effects is an important problem across many domains.
Currently, most existing works rely exclusively on observational data.
We propose to estimate heterogeneous treatment effects by combining large amounts of observational data and small amounts of randomized data.
arXiv Detail & Related papers (2022-02-25T18:59:54Z) - A Robust and Flexible EM Algorithm for Mixtures of Elliptical
Distributions with Missing Data [71.9573352891936]
This paper tackles the problem of missing data imputation for noisy and non-Gaussian data.
A new EM algorithm is investigated for mixtures of elliptical distributions with the property of handling potential missing data.
Experimental results on synthetic data demonstrate that the proposed algorithm is robust to outliers and can be used with non-Gaussian data.
arXiv Detail & Related papers (2022-01-28T10:01:37Z) - Bayesian data combination model with Gaussian process latent variable
model for mixed observed variables under NMAR missingness [0.0]
It is difficult to obtain a "(quasi) single-source dataset" in which the variables of interest are simultaneously observed.
It is necessary to utilize these datasets as a single-source dataset with missing variables.
We propose a data fusion method that does not assume that datasets are homogenous.
arXiv Detail & Related papers (2021-09-01T16:09:55Z) - The UU-test for Statistical Modeling of Unimodal Data [0.20305676256390928]
We propose a technique called UU-test (Unimodal Uniform test) to decide on the unimodality of a one-dimensional dataset.
A unique feature of this approach is that in the case of unimodality, it also provides a statistical model of the data in the form of a Uniform Mixture Model.
arXiv Detail & Related papers (2020-08-28T08:34:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.