$\texttt{causalAssembly}$: Generating Realistic Production Data for
Benchmarking Causal Discovery
- URL: http://arxiv.org/abs/2306.10816v2
- Date: Wed, 14 Feb 2024 17:45:54 GMT
- Title: $\texttt{causalAssembly}$: Generating Realistic Production Data for
Benchmarking Causal Discovery
- Authors: Konstantin G\"obler, Tobias Windisch, Mathias Drton, Tim Pychynski,
Steffen Sonntag, Martin Roth
- Abstract summary: We build a system for generation of semisynthetic manufacturing data that supports benchmarking of causal discovery methods.
We employ distributional random forests to flexibly estimate and represent conditional distributions.
Using the library, we showcase how to benchmark several well-known causal discovery algorithms.
- Score: 1.3048920509133808
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Algorithms for causal discovery have recently undergone rapid advances and
increasingly draw on flexible nonparametric methods to process complex data.
With these advances comes a need for adequate empirical validation of the
causal relationships learned by different algorithms. However, for most real
data sources true causal relations remain unknown. This issue is further
compounded by privacy concerns surrounding the release of suitable high-quality
data. To help address these challenges, we gather a complex dataset comprising
measurements from an assembly line in a manufacturing context. This line
consists of numerous physical processes for which we are able to provide ground
truth causal relationships on the basis of a detailed study of the underlying
physics. We use the assembly line data and associated ground truth information
to build a system for generation of semisynthetic manufacturing data that
supports benchmarking of causal discovery methods. To accomplish this, we
employ distributional random forests in order to flexibly estimate and
represent conditional distributions that may be combined into joint
distributions that strictly adhere to a causal model over the observed
variables. The estimated conditionals and tools for data generation are made
available in our Python library $\texttt{causalAssembly}$. Using the library,
we showcase how to benchmark several well-known causal discovery algorithms.
Related papers
- DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - AcceleratedLiNGAM: Learning Causal DAGs at the speed of GPUs [57.12929098407975]
We show that by efficiently parallelizing existing causal discovery methods, we can scale them to thousands of dimensions.
Specifically, we focus on the causal ordering subprocedure in DirectLiNGAM and implement GPU kernels to accelerate it.
This allows us to apply DirectLiNGAM to causal inference on large-scale gene expression data with genetic interventions yielding competitive results.
arXiv Detail & Related papers (2024-03-06T15:06:11Z) - Federated Causal Discovery from Heterogeneous Data [70.31070224690399]
We propose a novel FCD method attempting to accommodate arbitrary causal models and heterogeneous data.
These approaches involve constructing summary statistics as a proxy of the raw data to protect data privacy.
We conduct extensive experiments on synthetic and real datasets to show the efficacy of our method.
arXiv Detail & Related papers (2024-02-20T18:53:53Z) - Discovering Mixtures of Structural Causal Models from Time Series Data [23.18511951330646]
We propose a general variational inference-based framework called MCD to infer the underlying causal models.
Our approach employs an end-to-end training process that maximizes an evidence-lower bound for the data likelihood.
We demonstrate that our method surpasses state-of-the-art benchmarks in causal discovery tasks.
arXiv Detail & Related papers (2023-10-10T05:13:10Z) - Salesforce CausalAI Library: A Fast and Scalable Framework for Causal
Analysis of Time Series and Tabular Data [76.85310770921876]
We introduce the Salesforce CausalAI Library, an open-source library for causal analysis using observational data.
The goal of this library is to provide a fast and flexible solution for a variety of problems in the domain of causality.
arXiv Detail & Related papers (2023-01-25T22:42:48Z) - Boosting Synthetic Data Generation with Effective Nonlinear Causal
Discovery [11.81479419498206]
In software testing, data privacy, imbalanced learning, and artificial intelligence explanation, it is crucial to generate plausible data samples.
A common assumption of approaches widely used for data generation is the independence of the features.
We propose a synthetic dataset generator that can discover nonlinear causalities among the variables and use them at generation time.
arXiv Detail & Related papers (2023-01-18T10:54:06Z) - Amortized Inference for Causal Structure Learning [72.84105256353801]
Learning causal structure poses a search problem that typically involves evaluating structures using a score or independence test.
We train a variational inference model to predict the causal structure from observational/interventional data.
Our models exhibit robust generalization capabilities under substantial distribution shift.
arXiv Detail & Related papers (2022-05-25T17:37:08Z) - Federated Causal Discovery [74.37739054932733]
This paper develops a gradient-based learning framework named DAG-Shared Federated Causal Discovery (DS-FCD)
It can learn the causal graph without directly touching local data and naturally handle the data heterogeneity.
Extensive experiments on both synthetic and real-world datasets verify the efficacy of the proposed method.
arXiv Detail & Related papers (2021-12-07T08:04:12Z) - Iterative Rule Extension for Logic Analysis of Data: an MILP-based
heuristic to derive interpretable binary classification from large datasets [0.6526824510982799]
This work presents IRELAND, an algorithm that allows for abstracting Boolean phrases in DNF from data with up to 10,000 samples and sample characteristics.
The results show that for large datasets IRELAND outperforms the current state-of-the-art and can find solutions for datasets where current models run out of memory or need excessive runtimes.
arXiv Detail & Related papers (2021-10-25T13:31:30Z) - Causal-TGAN: Generating Tabular Data Using Causal Generative Adversarial
Networks [7.232789848964222]
We propose a causal model named Causal Tabular Generative Neural Network (Causal-TGAN) to generate synthetic data.
Experiments on both simulated datasets and real datasets demonstrate the better performance of our method.
arXiv Detail & Related papers (2021-04-21T17:59:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.