High Dimensional Data Enrichment: Interpretable, Fast, and
Data-Efficient
- URL: http://arxiv.org/abs/1806.04047v4
- Date: Fri, 30 Jun 2023 06:38:41 GMT
- Title: High Dimensional Data Enrichment: Interpretable, Fast, and
Data-Efficient
- Authors: Amir Asiaee, Samet Oymak, Kevin R. Coombes, Arindam Banerjee
- Abstract summary: We introduce an estimator for the problem of multiple connected linear regressions known as Data Enrichment/Sharing.
We show that the recovery of the common parameter benefits from emphall of the pooled samples.
Overall, we present a first thorough statistical and computational analysis of inference in the data-sharing model.
- Score: 38.40316295019222
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the problem of multi-task learning in the high dimensional
setting. In particular, we introduce an estimator and investigate its
statistical and computational properties for the problem of multiple connected
linear regressions known as Data Enrichment/Sharing. The between-tasks
connections are captured by a cross-tasks \emph{common parameter}, which gets
refined by per-task \emph{individual parameters}. Any convex function, e.g.,
norm, can characterize the structure of both common and individual parameters.
We delineate the sample complexity of our estimator and provide a high
probability non-asymptotic bound for estimation error of all parameters under a
geometric condition. We show that the recovery of the common parameter benefits
from \emph{all} of the pooled samples. We propose an iterative estimation
algorithm with a geometric convergence rate and supplement our theoretical
analysis with experiments on synthetic data. Overall, we present a first
thorough statistical and computational analysis of inference in the
data-sharing model.
Related papers
- Large Dimensional Independent Component Analysis: Statistical Optimality
and Computational Tractability [13.104413212606577]
We investigate the optimal statistical performance and the impact of computational constraints for independent component analysis (ICA)
We show that the optimal sample complexity is linear in dimensionality.
We develop computationally tractable estimates that attain both the optimal sample complexity and minimax optimal rates of convergence.
arXiv Detail & Related papers (2023-03-31T15:46:30Z) - Kernel-based off-policy estimation without overlap: Instance optimality
beyond semiparametric efficiency [53.90687548731265]
We study optimal procedures for estimating a linear functional based on observational data.
For any convex and symmetric function class $mathcalF$, we derive a non-asymptotic local minimax bound on the mean-squared error.
arXiv Detail & Related papers (2023-01-16T02:57:37Z) - Learning to Bound Counterfactual Inference in Structural Causal Models
from Observational and Randomised Data [64.96984404868411]
We derive a likelihood characterisation for the overall data that leads us to extend a previous EM-based algorithm.
The new algorithm learns to approximate the (unidentifiability) region of model parameters from such mixed data sources.
It delivers interval approximations to counterfactual results, which collapse to points in the identifiable case.
arXiv Detail & Related papers (2022-12-06T12:42:11Z) - Semi-Supervised Quantile Estimation: Robust and Efficient Inference in High Dimensional Settings [0.5735035463793009]
We consider quantile estimation in a semi-supervised setting, characterized by two available data sets.
We propose a family of semi-supervised estimators for the response quantile(s) based on the two data sets.
arXiv Detail & Related papers (2022-01-25T10:02:23Z) - Gaining Outlier Resistance with Progressive Quantiles: Fast Algorithms
and Theoretical Studies [1.6457778420360534]
A framework of outlier-resistant estimation is introduced to robustify arbitrarily loss function.
A new technique is proposed to alleviate the requirement on starting point such that on regular datasets the number of data reestimations can be substantially reduced.
The obtained estimators, though not necessarily globally or even globally, enjoymax optimality in both low dimensions.
arXiv Detail & Related papers (2021-12-15T20:35:21Z) - Post-mortem on a deep learning contest: a Simpson's paradox and the
complementary roles of scale metrics versus shape metrics [61.49826776409194]
We analyze a corpus of models made publicly-available for a contest to predict the generalization accuracy of neural network (NN) models.
We identify what amounts to a Simpson's paradox: where "scale" metrics perform well overall but perform poorly on sub partitions of the data.
We present two novel shape metrics, one data-independent, and the other data-dependent, which can predict trends in the test accuracy of a series of NNs.
arXiv Detail & Related papers (2021-06-01T19:19:49Z) - Doubly Robust Semiparametric Difference-in-Differences Estimators with
High-Dimensional Data [15.27393561231633]
We propose a doubly robust two-stage semiparametric difference-in-difference estimator for estimating heterogeneous treatment effects.
The first stage allows a general set of machine learning methods to be used to estimate the propensity score.
In the second stage, we derive the rates of convergence for both the parametric parameter and the unknown function.
arXiv Detail & Related papers (2020-09-07T15:14:29Z) - Asymptotic Analysis of an Ensemble of Randomly Projected Linear
Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets.
We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator.
We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z) - Semiparametric Nonlinear Bipartite Graph Representation Learning with
Provable Guarantees [106.91654068632882]
We consider the bipartite graph and formalize its representation learning problem as a statistical estimation problem of parameters in a semiparametric exponential family distribution.
We show that the proposed objective is strongly convex in a neighborhood around the ground truth, so that a gradient descent-based method achieves linear convergence rate.
Our estimator is robust to any model misspecification within the exponential family, which is validated in extensive experiments.
arXiv Detail & Related papers (2020-03-02T16:40:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.