Estimation of conditional average treatment effects on distributed confidential data
- URL: http://arxiv.org/abs/2402.02672v5
- Date: Fri, 25 Jul 2025 00:50:45 GMT
- Title: Estimation of conditional average treatment effects on distributed confidential data
- Authors: Yuji Kawamata, Ryoki Motai, Yukihiko Okada, Akira Imakura, Tetsuya Sakurai,
- Abstract summary: Conditional average treatment effects (CATEs) can be estimated with high accuracy if data distributed across multiple parties are centralized.<n>It is difficult to aggregate such data owing to confidentiality or privacy concerns.<n>We propose data collaboration double machine learning, a method for estimating CATE models using privacy-preserving fusion data constructed from distributed sources.
- Score: 6.798254568821052
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The estimation of conditional average treatment effects (CATEs) is an important topic in many scientific fields. CATEs can be estimated with high accuracy if data distributed across multiple parties are centralized. However, it is difficult to aggregate such data owing to confidentiality or privacy concerns. To address this issue, we propose data collaboration double machine learning, a method for estimating CATE models using privacy-preserving fusion data constructed from distributed sources, and evaluate its performance through simulations. We make three main contributions. First, our method enables estimation and testing of semi-parametric CATE models without iterative communication on distributed data, providing robustness to model mis-specification compared to parametric approaches. Second, it enables collaborative estimation across different time points and parties by accumulating a knowledge base. Third, our method performs as well as or better than existing methods in simulations using synthetic, semi-synthetic, and real-world datasets.
Related papers
- DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets.<n>Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining.<n>Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z) - Bayesian Surrogate Training on Multiple Data Sources: A Hybrid Modeling Strategy [1.2435663633224636]
We propose two novel approaches to integrate simulation data and real-world measurement data during surrogate training.<n>The first method trains separate surrogate models for each data source and combines their predictive distributions, while the second incorporates both data sources by training a single surrogate.
arXiv Detail & Related papers (2024-12-16T15:27:28Z) - Ranking and Combining Latent Structured Predictive Scores without Labeled Data [2.5064967708371553]
This paper introduces a novel structured unsupervised ensemble learning model (SUEL)
It exploits the dependency between a set of predictors with continuous predictive scores, rank the predictors without labeled data and combine them to an ensembled score with weights.
The efficacy of the proposed methods is rigorously assessed through both simulation studies and real-world application of risk genes discovery.
arXiv Detail & Related papers (2024-08-14T20:14:42Z) - Collaborative Heterogeneous Causal Inference Beyond Meta-analysis [68.4474531911361]
We propose a collaborative inverse propensity score estimator for causal inference with heterogeneous data.
Our method shows significant improvements over the methods based on meta-analysis when heterogeneity increases.
arXiv Detail & Related papers (2024-04-24T09:04:36Z) - Latent Semantic Consensus For Deterministic Geometric Model Fitting [109.44565542031384]
We propose an effective method called Latent Semantic Consensus (LSC)
LSC formulates the model fitting problem into two latent semantic spaces based on data points and model hypotheses.
LSC is able to provide consistent and reliable solutions within only a few milliseconds for general multi-structural model fitting.
arXiv Detail & Related papers (2024-03-11T05:35:38Z) - Counterfactual Data Augmentation with Contrastive Learning [27.28511396131235]
We introduce a model-agnostic data augmentation method that imputes the counterfactual outcomes for a selected subset of individuals.
We use contrastive learning to learn a representation space and a similarity measure such that in the learned representation space close individuals identified by the learned similarity measure have similar potential outcomes.
This property ensures reliable imputation of counterfactual outcomes for the individuals with close neighbors from the alternative treatment group.
arXiv Detail & Related papers (2023-11-07T00:36:51Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - Meta-learning for heterogeneous treatment effect estimation with
closed-form solvers [30.343569752920754]
This article proposes a meta-learning method for estimating the conditional average treatment effect (CATE) from a few observational data.
The proposed method learns how to estimate CATEs from multiple tasks and uses the knowledge for unseen tasks.
arXiv Detail & Related papers (2023-05-19T00:07:38Z) - Data-SUITE: Data-centric identification of in-distribution incongruous
examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data.
We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z) - Evaluating Causal Inference Methods [0.4588028371034407]
We introduce a deep generative model-based framework, Credence, to validate causal inference methods.
Our work introduces a deep generative model-based framework, Credence, to validate causal inference methods.
arXiv Detail & Related papers (2022-02-09T00:21:22Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Noise-Resistant Deep Metric Learning with Probabilistic Instance
Filtering [59.286567680389766]
Noisy labels are commonly found in real-world data, which cause performance degradation of deep neural networks.
We propose Probabilistic Ranking-based Instance Selection with Memory (PRISM) approach for DML.
PRISM calculates the probability of a label being clean, and filters out potentially noisy samples.
arXiv Detail & Related papers (2021-08-03T12:15:25Z) - MINIMALIST: Mutual INformatIon Maximization for Amortized Likelihood
Inference from Sampled Trajectories [61.3299263929289]
Simulation-based inference enables learning the parameters of a model even when its likelihood cannot be computed in practice.
One class of methods uses data simulated with different parameters to infer an amortized estimator for the likelihood-to-evidence ratio.
We show that this approach can be formulated in terms of mutual information between model parameters and simulated data.
arXiv Detail & Related papers (2021-06-03T12:59:16Z) - Federated Estimation of Causal Effects from Observational Data [19.657789891394504]
We present a novel framework for causal inference with federated data sources.
We assess and integrate local causal effects from different private data sources without centralizing them.
arXiv Detail & Related papers (2021-05-31T08:06:00Z) - A similarity-based Bayesian mixture-of-experts model [0.5156484100374058]
We present a new non-parametric mixture-of-experts model for multivariate regression problems.
Using a conditionally specified model, predictions for out-of-sample inputs are based on similarities to each observed data point.
Posterior inference is performed on the parameters of the mixture as well as the distance metric.
arXiv Detail & Related papers (2020-12-03T18:08:30Z) - Distributed Learning of Finite Gaussian Mixtures [21.652015112462]
We study split-and-conquer approaches for the distributed learning of finite Gaussian mixtures.
New estimator is shown to be consistent and retains root-n consistency under some general conditions.
Experiments based on simulated and real-world data show that the proposed split-and-conquer approach has comparable statistical performance with the global estimator.
arXiv Detail & Related papers (2020-10-20T16:17:47Z) - Learning while Respecting Privacy and Robustness to Distributional
Uncertainties and Adversarial Data [66.78671826743884]
The distributionally robust optimization framework is considered for training a parametric model.
The objective is to endow the trained model with robustness against adversarially manipulated input data.
Proposed algorithms offer robustness with little overhead.
arXiv Detail & Related papers (2020-07-07T18:25:25Z) - Machine learning for causal inference: on the use of cross-fit
estimators [77.34726150561087]
Doubly-robust cross-fit estimators have been proposed to yield better statistical properties.
We conducted a simulation study to assess the performance of several estimators for the average causal effect (ACE)
When used with machine learning, the doubly-robust cross-fit estimators substantially outperformed all of the other estimators in terms of bias, variance, and confidence interval coverage.
arXiv Detail & Related papers (2020-04-21T23:09:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.