A graph-structured distance for heterogeneous datasets with meta variables
- URL: http://arxiv.org/abs/2405.13073v1
- Date: Mon, 20 May 2024 23:11:03 GMT
- Title: A graph-structured distance for heterogeneous datasets with meta variables
- Authors: Edward Hallé-Hannan, Charles Audet, Youssef Diouane, Sébastien Le Digabel, Paul Saves,
- Abstract summary: Heterogeneous datasets emerge in various machine learning or optimization applications.
The first main contribution is a modeling graph-structured framework that generalizes state-of-the-art hierarchical, tree-structured, or variable-size frameworks.
The second main contribution is the graph-structured distance that compares extended points with any combination of included and excluded variables.
- Score: 1.677718351174347
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Heterogeneous datasets emerge in various machine learning or optimization applications that feature different data sources, various data types and complex relationships between variables. In practice, heterogeneous datasets are often partitioned into smaller well-behaved ones that are easier to process. However, some applications involve expensive-to-generate or limited size datasets, which motivates methods based on the whole dataset. The first main contribution of this work is a modeling graph-structured framework that generalizes state-of-the-art hierarchical, tree-structured, or variable-size frameworks. This framework models domains that involve heterogeneous datasets in which variables may be continuous, integer, or categorical, with some identified as meta if their values determine the inclusion/exclusion or affect the bounds of other so-called decreed variables. Excluded variables are introduced to manage variables that are either included or excluded depending on the given points. The second main contribution is the graph-structured distance that compares extended points with any combination of included and excluded variables: any pair of points can be compared, allowing to work directly in heterogeneous datasets with meta variables. The contributions are illustrated with some regression experiments, in which the performance of a multilayer perceptron with respect to its hyperparameters is modeled with inverse distance weighting and $K$-nearest neighbors models.
Related papers
- Learning High-Dimensional Differential Graphs From Multi-Attribute Data [12.94486861344922]
We consider the problem of estimating differences in two Gaussian graphical models (GGMs) which are known to have similar structure.
Existing methods for differential graph estimation are based on single-attribute (SA) models.
In this paper, we analyze a group lasso penalized D-trace loss function approach for differential graph learning from multi-attribute data.
arXiv Detail & Related papers (2023-12-05T18:54:46Z) - Latent Variable Multi-output Gaussian Processes for Hierarchical
Datasets [0.8057006406834466]
Multi-output Gaussian processes (MOGPs) have been introduced to deal with multiple tasks by exploiting the correlations between different outputs.
This paper proposes an extension of MOGPs for hierarchical datasets.
arXiv Detail & Related papers (2023-08-31T15:52:35Z) - Heterogeneous Multi-Task Gaussian Cox Processes [61.67344039414193]
We present a novel extension of multi-task Gaussian Cox processes for modeling heterogeneous correlated tasks jointly.
A MOGP prior over the parameters of the dedicated likelihoods for classification, regression and point process tasks can facilitate sharing of information between heterogeneous tasks.
We derive a mean-field approximation to realize closed-form iterative updates for estimating model parameters.
arXiv Detail & Related papers (2023-08-29T15:01:01Z) - iSCAN: Identifying Causal Mechanism Shifts among Nonlinear Additive
Noise Models [48.33685559041322]
This paper focuses on identifying the causal mechanism shifts in two or more related datasets over the same set of variables.
Code implementing the proposed method is open-source and publicly available at https://github.com/kevinsbello/iSCAN.
arXiv Detail & Related papers (2023-06-30T01:48:11Z) - High-Dimensional Undirected Graphical Models for Arbitrary Mixed Data [2.2871867623460207]
In many applications data span variables of different types, whose principled joint analysis is nontrivial.
Recent advances have shown how the binary-continuous case can be tackled, but the general mixed variable type regime remains challenging.
We propose flexible and scalable methodology for data with variables of entirely general mixed type.
arXiv Detail & Related papers (2022-11-21T18:21:31Z) - Equivariance Allows Handling Multiple Nuisance Variables When Analyzing
Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution.
We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z) - Inference of Multiscale Gaussian Graphical Model [0.0]
We propose a new method allowing to simultaneously infer a hierarchical clustering structure and the graphs describing the structure of independence at each level of the hierarchy.
Results on real and synthetic data are presented.
arXiv Detail & Related papers (2022-02-11T17:11:20Z) - BCDAG: An R package for Bayesian structure and Causal learning of
Gaussian DAGs [77.34726150561087]
We introduce the R package for causal discovery and causal effect estimation from observational data.
Our implementation scales efficiently with the number of observations and, whenever the DAGs are sufficiently sparse, the number of variables in the dataset.
We then illustrate the main functions and algorithms on both real and simulated datasets.
arXiv Detail & Related papers (2022-01-28T09:30:32Z) - MURAL: An Unsupervised Random Forest-Based Embedding for Electronic
Health Record Data [59.26381272149325]
We present an unsupervised random forest for representing data with disparate variable types.
MURAL forests consist of a set of decision trees where node-splitting variables are chosen at random.
We show that using our approach, we can visualize and classify data more accurately than competing approaches.
arXiv Detail & Related papers (2021-11-19T22:02:21Z) - For high-dimensional hierarchical models, consider exchangeability of
effects across covariates instead of across datasets [18.74167116981788]
We show that standard practice exhibits poor statistical performance when the number of covariates exceeds the number of datasets.
In statistical genetics, we might regress dozens of traits (defining datasets) for thousands of individuals (responses) on up to millions of genetic variants.
We propose a hierarchical model expressing our alternative perspective.
arXiv Detail & Related papers (2021-07-13T23:23:06Z) - Generalized Matrix Factorization: efficient algorithms for fitting
generalized linear latent variable models to large data arrays [62.997667081978825]
Generalized Linear Latent Variable models (GLLVMs) generalize such factor models to non-Gaussian responses.
Current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large datasets.
We propose a new approach for fitting GLLVMs to high-dimensional datasets, based on approximating the model using penalized quasi-likelihood.
arXiv Detail & Related papers (2020-10-06T04:28:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.