Double Machine Learning for Adaptive Causal Representation in High-Dimensional Data
- URL: http://arxiv.org/abs/2411.14665v1
- Date: Fri, 22 Nov 2024 01:54:53 GMT
- Title: Double Machine Learning for Adaptive Causal Representation in High-Dimensional Data
- Authors: Lynda Aouar, Han Yu,
- Abstract summary: Support points sample splitting (SPSS) is employed for efficient double machine learning (DML) in causal inference.
The support points are selected and split as optimal representative points of the full raw data in a random sample.
They offer the best representation of a full big dataset, whereas the unit structural information of the underlying distribution via the traditional random data splitting is most likely not preserved.
- Score: 14.25379577156518
- License:
- Abstract: Adaptive causal representation learning from observational data is presented, integrated with an efficient sample splitting technique within the semiparametric estimating equation framework. The support points sample splitting (SPSS), a subsampling method based on energy distance, is employed for efficient double machine learning (DML) in causal inference. The support points are selected and split as optimal representative points of the full raw data in a random sample, in contrast to the traditional random splitting, and providing an optimal sub-representation of the underlying data generating distribution. They offer the best representation of a full big dataset, whereas the unit structural information of the underlying distribution via the traditional random data splitting is most likely not preserved. Three machine learning estimators were adopted for causal inference, support vector machine (SVM), deep learning (DL), and a hybrid super learner (SL) with deep learning (SDL), using SPSS. A comparative study is conducted between the proposed SVM, DL, and SDL representations using SPSS, and the benchmark results from Chernozhukov et al. (2018), which employed random forest, neural network, and regression trees with a random k-fold cross-fitting technique on the 401(k)-pension plan real data. The simulations show that DL with SPSS and the hybrid methods of DL and SL with SPSS outperform SVM with SPSS in terms of computational efficiency and the estimation quality, respectively.
Related papers
- On Pretraining Data Diversity for Self-Supervised Learning [57.91495006862553]
We explore the impact of training with more diverse datasets on the performance of self-supervised learning (SSL) under a fixed computational budget.
Our findings consistently demonstrate that increasing pretraining data diversity enhances SSL performance, albeit only when the distribution distance to the downstream data is minimal.
arXiv Detail & Related papers (2024-03-20T17:59:58Z) - Latent Semantic Consensus For Deterministic Geometric Model Fitting [109.44565542031384]
We propose an effective method called Latent Semantic Consensus (LSC)
LSC formulates the model fitting problem into two latent semantic spaces based on data points and model hypotheses.
LSC is able to provide consistent and reliable solutions within only a few milliseconds for general multi-structural model fitting.
arXiv Detail & Related papers (2024-03-11T05:35:38Z) - The Common Stability Mechanism behind most Self-Supervised Learning
Approaches [64.40701218561921]
We provide a framework to explain the stability mechanism of different self-supervised learning techniques.
We discuss the working mechanism of contrastive techniques like SimCLR, non-contrastive techniques like BYOL, SWAV, SimSiam, Barlow Twins, and DINO.
We formulate different hypotheses and test them using the Imagenet100 dataset.
arXiv Detail & Related papers (2024-02-22T20:36:24Z) - Soft Random Sampling: A Theoretical and Empirical Analysis [59.719035355483875]
Soft random sampling (SRS) is a simple yet effective approach for efficient deep neural networks when dealing with massive data.
It selects a uniformly speed at random with replacement from each data set in each epoch.
It is shown to be a powerful and competitive strategy with significant and competitive performance on real-world industrial scale.
arXiv Detail & Related papers (2023-11-21T17:03:21Z) - Separability and Scatteredness (S&S) Ratio-Based Efficient SVM
Regularization Parameter, Kernel, and Kernel Parameter Selection [10.66048003460524]
Support Vector Machine (SVM) is a robust machine learning algorithm with broad applications in classification, regression, and outlier detection.
This work shows that the SVM performance can be modeled as a function of separability and scatteredness (S&S) of the data.
arXiv Detail & Related papers (2023-05-17T13:51:43Z) - Revisiting the Evaluation of Image Synthesis with GANs [55.72247435112475]
This study presents an empirical investigation into the evaluation of synthesis performance, with generative adversarial networks (GANs) as a representative of generative models.
In particular, we make in-depth analyses of various factors, including how to represent a data point in the representation space, how to calculate a fair distance using selected samples, and how many instances to use from each set.
arXiv Detail & Related papers (2023-04-04T17:54:32Z) - Approximate Thompson Sampling via Epistemic Neural Networks [26.872304174606278]
Epistemic neural networks (ENNs) are designed to produce accurate joint predictive distributions.
We show that ENNs serve this purpose well and illustrate how the quality of joint predictive distributions drives performance.
arXiv Detail & Related papers (2023-02-18T01:58:15Z) - Linking data separation, visual separation, and classifier performance
using pseudo-labeling by contrastive learning [125.99533416395765]
We argue that the performance of the final classifier depends on the data separation present in the latent space and visual separation present in the projection.
We demonstrate our results by the classification of five real-world challenging image datasets of human intestinal parasites with only 1% supervised samples.
arXiv Detail & Related papers (2023-02-06T10:01:38Z) - Convergence Analysis of Sequential Split Learning on Heterogeneous Data [6.937859054591121]
Split Learning (SL) and Federated Averaging (Fed) are two popular paradigms distributed machine learning.
We derive convergence guarantees of SL/general SL/non-Avg on heterogeneous data.
We validate the counterintuitive analysis result empirically on extremely heterogeneous data.
arXiv Detail & Related papers (2023-02-03T10:04:44Z) - Distributed Learning of Generalized Linear Causal Networks [19.381934612280993]
We propose a novel structure learning method called distributed annealing on regularized likelihood score (DARLS)
DARLS is the first method for learning causal graphs with such theoretical guarantees.
In a real-world application for modeling protein-DNA binding networks with distributed ChIP-Sequencing data, DARLS exhibits higher predictive power than other methods.
arXiv Detail & Related papers (2022-01-23T06:33:25Z) - SHRIMP: Sparser Random Feature Models via Iterative Magnitude Pruning [3.775565013663731]
We propose a new method -- Sparser Random Feature Models via IMP (ShRIMP) -- to efficiently fit high-dimensional data with inherent low-dimensional structure.
Our method can be viewed as a combined process to construct and find sparse lottery tickets for two-layer dense networks.
arXiv Detail & Related papers (2021-12-07T21:32:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.